I have a dataset that comprises several data streams that are measured on objects (>10k objects). The data is essentially time series data (0.5 second intervals). Typically, an expert interpreter would then manually segment the data (~500 seconds in duration), set a series of parameters for each segment based on the data from that segment, and then apply a series of equations that utilize the measured data and the parameters to calculate a series of properties about that object. The data from one object is normally split into ~10 segments and the manual interpretation often takes that skilled interpreter 4 or more hours.
Machine learning seems like a promising approach to either:
predict the resulting properties directly
predict the parameters for each segment (this approach is actually more desirable as it offers more transparency for downstream users).
The issue is that for each object it takes 4 hours to "label" (i.e., interpret) one object, which only results in maybe 10 data points for any parameter that can be used in training. As interpretations are considered proprietary in the industry, there is no ready database that can be utilized. Building up a database of say 500 interpreted objects would take several months of work, which is not presently feasible.
My questions are:
- When labeling costs are very high is there a standard approach to building a reference database?
- Are there other approaches when dealing with manual interpretations that may make more sense? E.g., using some sort of self play mechanism or human feedback (I can often tell very quickly if a result is really bad).
- As this is a physics-based system, could using physics-informed neural nets somehow reduce the need for data volume?