Using training data that requires manual interpretation

Question

I have a dataset that comprises several data streams that are measured on objects (>10k objects). The data is essentially time series data (0.5 second intervals). Typically, an expert interpreter would then manually segment the data (~500 seconds in duration), set a series of parameters for each segment based on the data from that segment, and then apply a series of equations that utilize the measured data and the parameters to calculate a series of properties about that object. The data from one object is normally split into ~10 segments and the manual interpretation often takes that skilled interpreter 4 or more hours.

Machine learning seems like a promising approach to either:

predict the resulting properties directly
predict the parameters for each segment (this approach is actually more desirable as it offers more transparency for downstream users).

The issue is that for each object it takes 4 hours to "label" (i.e., interpret) one object, which only results in maybe 10 data points for any parameter that can be used in training. As interpretations are considered proprietary in the industry, there is no ready database that can be utilized. Building up a database of say 500 interpreted objects would take several months of work, which is not presently feasible.

My questions are:

When labeling costs are very high is there a standard approach to building a reference database?
Are there other approaches when dealing with manual interpretations that may make more sense? E.g., using some sort of self play mechanism or human feedback (I can often tell very quickly if a result is really bad).
As this is a physics-based system, could using physics-informed neural nets somehow reduce the need for data volume?

score 4 · Accepted Answer · answered May 08 '24 at 12:21

There are many cases where creating labels is expensive and time consuming. For that reason, there exist multiple approaches for that case.

In the following, I give a brief overview of some techniques. These will give you an idea what might work in your case. In the end, as always, it depends on your data und use case what works and what not. For this reason, I will not discuss concrete methods, but name promesing directions for your case (some of the approaches overlap):

Semi-Supervised Learning

Semisupervised Learning approaches combine unlabeled data and labeled data for prediction tasks. For this reason, less label samples are required and the unlabeled data can be used.

Few Shot Learning

There is a variety of algorithms that aim for learning from few examples. Typically some kind of "general knowledge" must be present. Many approaches deal with visual tasks, where the "general knowledge" might be the ability to distinguish shapes. A new shape can then quickly be learned. Nevertheless, there might also be good methods for your case.

Pre-Trained Models

Pre-Trained models are trained for a general use case (e.g. predictions on time-series). A subsequent training for a specific use case requires significantly less samples & time then training from scratch.

Active Learning

Active Learning is a way to interatively label data. The active learning algorithm select the next sample for labeling in way that promises the most improvement in predictive power / the most reduction of the loss. This might be interesting in your case where you can choose which objects to give to an expert for labeling.

Manual Feature Engineering

One can reduce the input space by manually designing meaningful features. A smaller feature space typically requires less training data.