1

I have a dataset of 1600 rows and 28 columns. Only one column is partially complete with 1300 records. The rest is NaN. I did a value count of this columns and it has 84 different categories that are nominal. What is the best way to impute this column. I need to convert these in numbers impute it and then convert back. I understand that One-Hot encoding does not work in this case because of the high cardinality.

What is the best way to approach this problem?

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51

2 Answers2

1

I will make a biased suggestion, reading this paper might provide some insights

"Quantile encoder: tackling high cardinality categorical features in regression problems"

Even thought there are many methods, with many of them implemented in the category encoders package (https://contrib.scikit-learn.org/category_encoders/). This paper can serve as good understanding.

In case you are dealing with socially sensitive data, you might want to have a look at this paper "Fairness implications of encoding protected categorical attributes". That you can find it on:

I hope it helps :)

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51
1

You may want to look into Target Encoding as an example:

https://contrib.scikit-learn.org/category_encoders/targetencoder.html

Another post from the forums:

Problem with converting string to dummy variables

Brandon Donehoo
  • 376
  • 1
  • 8