Apply a clustering algorithm on categorical data with features of multiple values

Question

Let us I have a people data like gender, age, marital status, education, employment, hobbies.

I want to make clusters of those people, having some similarity/common among them (for example they have common hobby, education, age etc.).

Here there is a sample of my dataset:

I should use an algorithm which works with categorical data like K-Prototypes but I am not sure how to specifically handle the hobbies, because that feature may have many values from 1 to N.

score -1 · Answer 1 · answered Oct 02 '19 at 06:28

K-means clustering is based on distance. Whenever you are able to define the distance between two values of a categorical feature, it is theoretically possible, yet not always straightforward, to use the algorithm.

The basic idea that I would recommend is then to give yourself distance metrics over each feature. This may not be easy. You may need to manually set the distance matrices. For example, for the Marital Status feature, assuming single is index 0, married is 1 and separated is 2, you could have the following matrix:

$$\begin{pmatrix} 0 & 0.8 & 0.3 \\ 0.8 & 0 & 0.5 \\ 0.3 & 0.5 & 0 \end{pmatrix}$$

If you can't define a relevant distance, you can just have it be 0 if both records have the same feature value, and 1 otherwise.

This would allow you to fully compute the distance between two records in your dataset. From then on, k-means algorithm can be applied as if all were numerical data.

Apply a clustering algorithm on categorical data with features of multiple values

1 Answers1