5

Possibly similar question: Is it ok to collect data using algorithm to train another?

I have a model that accurately describes an underlying physical, complex, system. The model is basically a set of ODEs based on the physics of the system, validated against measurements. When a system perturbation occurs, I can run thousands of simulations to assess if the new system status is secure or not. This is basically a classification procedure (yes/no).

This procedure is very time consuming and has to be performed in real-time (thus, requiring huge computational resources). There are thousands of possible perturbations and infinite number of initial points. The same perturbation from 1 initial point can lead to stable system, while from another, to an unstable.

My question is:

Is it possible to use data generated by a huge number of simulations to train a classification algorithm to perform this detection online? What are the considerations when using simulated data to train an algorithm that will then be used online with real data (expect from the obvious that the simulation needs to be very very accurate)? Any references to such examples?

I apologise if this is a basic question. I am new to data science techniques with a more physics/engineering background.

apetros85
  • 153
  • 1
  • 4

1 Answers1

6

Is it possible to use data generated by a huge number of simulations to train a classification algorithm to perform this detection online?

Yes, it is always possible to train a classification algorithm when you have labeled i.i.d. training data, and there is no hard reason why you cannot use a simulator to generate that.

Whether or not such a trained model is fit for purpose is hard to say in advance of trying it.

Using a simulation as your data source has some benefits:

  • Generating more training and test data is straightforward.

  • You will automatically have high quality ground truth labels (assuming your goal is to match the simulation).

  • If you find a problem with certain parameter values, you can target them when collecting more training data.

Just as with data taken from real world measurements, you will need to test your results to get a sense of how accurate your model is.

What are the considerations when using simulated data to train an algorithm that will then be used online with real data (expect from the obvious that the simulation needs to be very very accurate)?

  • Your model is a function approximator. At best it will match the output of the simulator. In practice it will usually fall short of it by some amount. You will have to measure this difference by testing the model, and decide whether the cost of occasional false negative or false positive is outweighed by the performance improvement.

  • Statistical machine learning models perform best when interpolating between data points, and often perform badly at extrapolating. So when you say that inputs can vary infinitely, hopefully that is within some constrained parameter space of real values, as opposed to getting inputs that are completely different from anything you have considered before - the simulation would cope with such inputs, but a statistics-based function approximator most likely would not.

  • If your simulation has areas where the class switches rapidly with small changes in parameter values, then you will need to sample densely in those areas.

  • If your simulation produces near chaotic behaviour in any region (class value varies a lot and is highly sensitive to small changes in value of one or more parameters), then this is something that is very hard to approximate.

  • If you have some natural scale factor, dimensionless number or other easy to compute summary of behaviour in your physical system, it may be worth using it as an engineered feature instead of getting the machine learning code to figure that out statistically. For instance, in fluid dynamics, the Reynolds number can characterise flow, and could be useful feature for neural network predicting vortex shedding.

Any references to such examples?

The examples I have found here are about are in renderings of fluid simulations and other complex physical systems where a full simulation can be approximated and they all use neural networks to achieve a speed improvement over full simulation.

However, I don't think any of these are classifiers.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101