1

I have a dataset that looks like so

ID var_id_invariant_1 ... var_id_invariant_p var_id_variant_1 ... var_id_variant_k target
315 25 ... a 2.4 ... A 1
246 31 ... nan 5.7 ... B 0
315 25 ... a 9.4 ... nan 1
... ... ... ... ... ... ... ...
153 75 ... d 1.2 ... A 0

In summary:

  1. ID is an entry, that can have multiple services and is split by those services into separate rows.
  2. Some variables are ID variant, meaning they can have different values for the same ID. Some are not.
  3. There are missing values.
  4. Target is ID invariant.
  5. There are numerical and categorical variables.

I want to build a predictive model to predict target on an entire ID, so I want to aggregate the rows by ID. My two main issues are the ID variant variables and the missing values.

If I were to impute the missing values before aggregation that would cause data leakage. I'm thinking about performing the imputation after train/test split, but I'm not sure about this. What is a good way of performing imputation for predictive modeling when there is a need for aggregation of the data?

Is there a way of aggregating many ID variant variables into a single row? This seems quite problematic, because some values can repeat. Maybe I could perform an operation like one-hot encoding, but with counts instead? These variables are related to a specific service, so ideally the row information of which values were associated with which would be maintained after the aggregation.

Mateusz
  • 135
  • 7

1 Answers1

2

This sounds like you're having issues grappling with relational theory.

You have focused on the ID column as though it identifies an observed example. But your narrative ("multiple services") makes it clear that it does not. It sounds like you have a compound PK, comprised of (id, service).

Some variables are ID variant, meaning they can have different values for the same ID.

This is explaining that a given column may be a function of multiple variables: f(id, service).

Some are not.

This explains that some attributes only care about a projection of the PK: f(id).
One of those is the target attribute.


I want to aggregate the rows by ID.

I am hard pressed to see why that would be a good thing to want. From your problem description, it sounds like you should want to aggregate the rows by (id, service).

Or put another way, you might want to aggregate by id, but only in cases where the relevant subset of columns is of interest, the columns that only care about a projection.

If I were to impute the missing values before aggregation that would cause data leakage.

Do imputation using the (id, service) PK tuple. There's no data leakage, no cheating, since both are visible in each observed example.

You are free to do imputation before or after the train/test split, as it depends only on inputs and not on the target column Y.

Is there a way of aggregating many ID variant variables into a single row?

Yes. In sql we call it a GROUP BY, but pandas and many other technologies offer similar facilities.

... some values can repeat.

Yes. So we use aggregate functions, such as MIN() or AVG().


Here's a thought experiment for you.

There are N services that appear in your input data. What would it look like if you did N modeling exercises, each focused on the data of a single service?

J_H
  • 1,233
  • 9
  • 12