How to encode high cardinality categorical data?

Question

I have a dataset of 1600 rows and 28 columns. Only one column is partially complete with 1300 records. The rest is NaN. I did a value count of this columns and it has 84 different categories that are nominal. What is the best way to impute this column. I need to convert these in numbers impute it and then convert back. I understand that One-Hot encoding does not work in this case because of the high cardinality.

What is the best way to approach this problem?

score 1 · Answer 1 · answered Oct 24 '22 at 21:47

I will make a biased suggestion, reading this paper might provide some insights

"Quantile encoder: tackling high cardinality categorical features in regression problems"

On arxiv:https://arxiv.org/abs/2105.13783
On Springer: https://link.springer.com/chapter/10.1007/978-3-030-85529-1_14

Even thought there are many methods, with many of them implemented in the category encoders package (https://contrib.scikit-learn.org/category_encoders/). This paper can serve as good understanding.

In case you are dealing with socially sensitive data, you might want to have a look at this paper "Fairness implications of encoding protected categorical attributes". That you can find it on:

Arxiv https://arxiv.org/abs/2201.11358
Montreal AI ethics institute: https://montrealethics.ai/fairness-implications-of-encoding-protected-categorical-attributes/

I hope it helps :)

score 1 · Answer 2 · answered Sep 21 '20 at 21:54

1

You may want to look into Target Encoding as an example:

https://contrib.scikit-learn.org/category_encoders/targetencoder.html

Another post from the forums:

Problem with converting string to dummy variables

answered Sep 21 '20 at 21:54

Brandon Donehoo

376
1
8

How to encode high cardinality categorical data?

2 Answers2