I work in physics. We have lots of experimental runs, with each run yielding a result, y and some parameters that should predict the result, x. Over time, we have found more and more parameters to record. So our data looks like the following:
Year 1 data: (2000 runs)
parameters: x1,x2,x3 target: y
Year 2 data: (2000 runs)
parameters: x1,x2,x3,x4,x5 target: y
Year 3 data: (2000 runs)
parameters: x1,x2,x3,x4,x5,x6,x7 target: y
How does one build a regression model that incorporates the additional information we recorded, without throwing away what it "learned" about the older parameters?
Should I:
- just set
x4,x5, etc to0or-1when I'm not using them? - completely ignore
x4,x5,x6,x7and only usex1,x2,x3? - add another parameter that is simply the number of parameters?
- train separate models for each year, and combine them somehow?
- "weight" the parameters, so as to ignore them if I set the weight to 0?
- make three different models, using
x1,x2,x3,x4,x5, andx6,x7parameters, and then interpolate somehow? - make a custom "imputer" to guestimate the missing parameters (using available parameters)
I have tried imputation using mean and median, but neither works very well because the parameters are not independent, but rather fairly correlated.