4

I have a dataset (10 million rows, 55 columns) with many missing values. I need to predict those values somehow using other non-missing values, i.e. replace them with something that is not NaN. Mean and median are not the solution here.

I tried to research other methods for that but none of them works since I have many categorical variables. I also tried to use one hot encoding to convert categorical variables to integers but I am not sure if that is a solution in my case since from only 1 categorical column I would get 600 new columns. If I do the same with other categorical columns, I would get many millions of new columns. One of the categorical columns is URL string and it is different for every row, so I have 10 million different URLs for example.

The other categorical column is a description and it is also different for every row. I could probably remove the URL column, but I can't remove description, title, location and others for example. I tried PCA, but it also doesn't work with categorical data. I have missing data for both categorical and integers/floats values. Would get_dummies method be a good approach to deal with this? For missing values imputation I tried KNN and maximum likelihood but I am getting errors due to categorical variables. Missing data is completely randomly missing.

Do you have any suggestions how to approach this problem and also which packages should I use?

user32550
  • 41
  • 1
  • 2

3 Answers3

2

I think like @El Burro suggested, you I believe you should focus on feature transformation mainly. Use different techniques for different features. For straightforward features, such as occupation or gender for example, use one-hot encoding, while for others you can use some kind of hierarchical mapping-clustering (e.g. map values to groups defined by you, for example if those urls linked to products, make 30 different groups with similar types of products and map the urls to these groups. Then you can use again one-hot encoding for these mapped features).

For the textual features you mentioned, I'm pretty confident you could drop some of them. If you really want to exploit some kind of information generated from text though, for starters you could either do some tf-idf feature extraction, so as to generate features for each text-snippet or consider topic modeling (take a look at gensim for python if interested) and represent each text as a mixture of topics.

Bogas
  • 596
  • 2
  • 8
1

i have impression, the question was more about how to impute mixed (categorical and contineous) data, than about reduction of the number of features (although an implied problem), so i might give it another try:

Since you not want to use some baseline imputation like mean or median, you will have to fit different models for the imputation of different columns in your data.

There are some ML models that do support mixed input feature types. For example, XGB. XGB also comes with the benefit of being fast in training and offering a quite powerful, yet easily set up generalisation method to prevent overfitting. (via early-stopping the training when booster performance degrades on the validation holdout).

Also, XGB does not require feature normalisation and can handle correlated features as well as inbalanced Data well, without preprocessing/tranformation help. It also can handle missing data (NaN) in the predictors, appearing during training or inference (although it will not perform very well on predicting samples with NaN features, if their specific NaN pattern wasnt part of the training data).

So, basically, to impute column c of your data set X, you would train an XGBClassifier or XGBRegressor (depending on the type of c), to predict c, given X. Depending on the severety and the pattern of the missingnes of your data, you might want to pick a complete data portion (all features present) for validation - or, if not available, rely on xgboost implicit imputation.

Now you can simply train one Imputation model per column. (test data hold out would have to be imputed with a model trained on the training data portion only)

There are several ways of enhancing this process or implementing it more sophisticated: for example, start with the feature that has the fewest missing values, than, after training a model to predict those, use the complete/imputed feature vector as predictor for the next columns model and so on. This, i think goes under the Name Iterative Imputation in the literature.

Than there are more intricate, readily packaged frameworks, like MICE or MissForest, wich wrap and interprete the iterative imputation task in different ways and with different base learners. (At least missForest, i think, can handle mixed data.).

petr
  • 176
  • 4
0

I had the same problem in one of the datasets I was using, and the answer is to focus more on feature transformation. If you simply include all the features of your dataset for encoding, you would probably end up with more numbers of columns than your rows!

I am optimistic there might be many features in your dataset that can be grouped in one column, and some features can be dropped because of multicollinearity (but I would suggest you double-check using other feature selection techniques, too).

  1. Mainly try to reduce the number of features (feature transformation) without dropping any of them.
  2. Try to impute nan values and then find multicollinearity for numerical features.
  3. If a feature contains more than 95% nan values, it would be wise to drop it as imputation can worsen the performance.
  4. The same goes for variance if a feature has more than 95% of its values, which are the same (for example, if I have a fuel feature with 98 values as petrol and 2 values as diesel, I would drop that feature as it won't contribute much). Check VarianceThreshold documentation on package.

There are many more techniques, and I am sure doing all of the above will reduce your number of features and, hence, your dimensionality.

Mario
  • 571
  • 1
  • 6
  • 24
spectre
  • 2,223
  • 2
  • 14
  • 37