7

I work in physics. We have lots of experimental runs, with each run yielding a result, y and some parameters that should predict the result, x. Over time, we have found more and more parameters to record. So our data looks like the following:

Year 1 data: (2000 runs)
    parameters: x1,x2,x3                target: y
Year 2 data: (2000 runs)
    parameters: x1,x2,x3,x4,x5          target: y
Year 3 data: (2000 runs)
    parameters: x1,x2,x3,x4,x5,x6,x7    target: y

How does one build a regression model that incorporates the additional information we recorded, without throwing away what it "learned" about the older parameters?

Should I:

  • just set x4, x5, etc to 0 or -1 when I'm not using them?
  • completely ignore x4,x5,x6,x7 and only use x1,x2,x3?
  • add another parameter that is simply the number of parameters?
  • train separate models for each year, and combine them somehow?
  • "weight" the parameters, so as to ignore them if I set the weight to 0?
  • make three different models, using x1,x2,x3, x4,x5, and x6,x7 parameters, and then interpolate somehow?
  • make a custom "imputer" to guestimate the missing parameters (using available parameters)

I have tried imputation using mean and median, but neither works very well because the parameters are not independent, but rather fairly correlated.

JoseOrtiz3
  • 172
  • 6

3 Answers3

2

One simple idea, no imputation needed: build a model using the parameters have always existed, then each time a new set of parameters gets added, use them to model the residual of the previous model. Then you can sum the contributions of all the models that apply to the data you happen to have. (If effects tend to multiply rather than add, you could do this in log space.)

Ken Arnold
  • 246
  • 1
  • 6
1

If the old variables and the new variables are highly correlated then you could do a more advanced form of imputation and make a model for each new input that predicts the new input given the old inputs. This model would probably be pretty effective good at predicting the new inputs because, as you said, there is a strong correlation among the inputs. Then you would split up all of your data across the years so that you have an equal proportion of old records and new records in your training, validation, and test sets.

Ryan Zotti
  • 4,209
  • 3
  • 21
  • 33
0

I would multiply-impute the values for x4, x5, x6, and x7. For the number of imputations, look at the whole dataset and compute the % of fields missing and round up to the nearest integer. Don't use mean- or median-imputation, use PROC MI in SAS or the equivalent. I would guess that because your data are monotone-missing, you could use a MONOTONE statement. This is probably the most conservative approach because you open yourself up to bias when excluding information--whether it be variables or observations.

The Baron
  • 1
  • 1