How to deal with categorical feature of very high cardinality?

Question

I would like to train a binary classifier on feature vectors. One of the features is categorical feature with string, it is the zip codes of a country.

Typically, there is thousands of zip codes, and in my case they are strings. How can convert this feature into numerical?

I do not think that using one-hot-encoding is good as a solution for my case. Am I right by saying that? If yes, what would be a suitable solution?

score 8 · Answer 1 · answered Apr 05 '18 at 05:20

This is an old question. I am surprised that I don't see anyone mentioned Mean Encoding (a.k.a Target Encoding). It is very popular in supervised learning problems. Besides, I have seen people use frequency or the cdf of the frequency (to avoid noise generated by heavy-tailed pdf), and they achieved pretty good results with lightGBM. However, i cannot really explain why it works rigorously.

score 7 · Accepted Answer · answered Mar 03 '16 at 20:36

One-hot-encoded ZIP codes shouldn't present a problem with modern tools, where features can be much wider (millions, billions even), but if you really want you could aggregate area codes into regions, such as states. Of course, you should not use strings, but bit vectors. Two other dimensionality reduction options are MCA (PCA for categorical variables) and random projection.

score 3 · Answer 3 · answered Sep 18 '19 at 15:39

You can use embedding which is mentioned in the comments. e.g. A general blog post, Keras documentation for embedding layer which can be used to learn the embedding. This is widely used by deep learning models when you need to reduce the number of features and it works for one categorical feature as well.

How to deal with categorical feature of very high cardinality?

3 Answers3

Linked