I have a dataset that looks like so
| ID | var_id_invariant_1 | ... | var_id_invariant_p | var_id_variant_1 | ... | var_id_variant_k | target |
|---|---|---|---|---|---|---|---|
| 315 | 25 | ... | a | 2.4 | ... | A | 1 |
| 246 | 31 | ... | nan | 5.7 | ... | B | 0 |
| 315 | 25 | ... | a | 9.4 | ... | nan | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 153 | 75 | ... | d | 1.2 | ... | A | 0 |
In summary:
- ID is an entry, that can have multiple services and is split by those services into separate rows.
- Some variables are ID variant, meaning they can have different values for the same ID. Some are not.
- There are missing values.
- Target is ID invariant.
- There are numerical and categorical variables.
I want to build a predictive model to predict target on an entire ID, so I want to aggregate the rows by ID. My two main issues are the ID variant variables and the missing values.
If I were to impute the missing values before aggregation that would cause data leakage. I'm thinking about performing the imputation after train/test split, but I'm not sure about this. What is a good way of performing imputation for predictive modeling when there is a need for aggregation of the data?
Is there a way of aggregating many ID variant variables into a single row? This seems quite problematic, because some values can repeat. Maybe I could perform an operation like one-hot encoding, but with counts instead? These variables are related to a specific service, so ideally the row information of which values were associated with which would be maintained after the aggregation.