3

I have a dataset with categorical features. I want to segment the data using clustering techniques. What could be the possible choices for this scenariogiven the fact that data has categorical features. Is there any variation of k-means which can be used here.

user3198880
  • 39
  • 1
  • 1
  • 2

4 Answers4

5

k-means is not a good choice, because it is designed for continuous variables. It is a least-squares problem definition - a deviation of 2.0 is 4x as bad as a deviation of 1.0.

On binary data (such as one-hot encoded categorical data), this notion of squared deviations is not very appropriate. In particular, the cluster centroids are not binary vectors anymore!

The question you should ask first is: "what is a cluster". Don't just hope an algorithm works. Choose (or build!) and algorithm that solves your problem, not someone else's!

On categorical data, frequent itemsets are usually the much better concept of a cluster than the centroid concept of k-means.

Has QUIT--Anony-Mousse
  • 8,134
  • 1
  • 16
  • 31
1

Not enough reputation to comment...

Do you have any insight on whether your categorical variables exhibit some ordering? Or are they nominal? Is it possible to impose an ordering on your variables such that it is intuitive?

Your problem comes down to choosing an appropriate distance metric. Or rather, what defines 'similarity' to you. There is a variant of the k-means algorithm called k-modes that you may want to explore. The last link below provides more information on this categorical clustering method.

In absence of knowing more about your data, these links might be useful:

https://stats.stackexchange.com/questions/56479/cluster-analysis-on-ordinal-data-likert-scale

https://stats.stackexchange.com/questions/28170/clustering-a-dataset-with-both-discrete-and-continuous-variables

K-Means clustering for mixed numeric and categorical data

dmanuge
  • 146
  • 2
0

I don't really see a reason why simple K-Means clustering shouldn't work. If you convert your categorical data into integers (or encode to binary where one column is equal to one category, so called "one-hot encoding"), you can then fetch it into the algorithm.

Then, you can compare the cluster between each other by, lets say, calculate the mode to see the differences.

Also, as dmanuge mentioned, playing with different metric can be helpful. But I'd go for this after the simple K-Means.

HonzaB
  • 1,699
  • 1
  • 14
  • 20
0

Your approach may depend on the number of features and the number of categories in each feature that you are trying to include in your model. I've used dummy variables to convert categorical data into numerical data and then used the dummy variables to do K-means clustering with some success.

Here's a small example:

+----+----+----+
| ID | F1 | F2 |
+----+----+----+
|  1 | a  | x  |
|  2 | d  | w  |
|  3 | f  | x  |
+----+----+----+

Create a column for each category of each feature. For each record, the value of the dummy variable field is 1 only in the dummy variable field that corresponds to the initial feature value. The rest are 0.

+----+------+------+------+------+------+
| ID | F1_a | F1_d | F1_f | F2_w | F2_x |
+----+------+------+------+------+------+
|  1 |    1 |    0 |    0 |    0 |    1 |
|  2 |    0 |    1 |    0 |    1 |    0 |
|  3 |    0 |    0 |    1 |    0 |    1 |
+----+------+------+------+------+------+

If you're working with Pandas in Python, pandas.get_dummies() can generate the dummy variables for you.

Sometimes, you could have so many categories it would been unreasonable to try and create a dummy variable for each one. For my problem, it was acceptable to only include in my model dummy variables for categories that occurred most frequently (e.g. Top 15 categories), but you'll have to decide whether or not that's appropriate for your problem.

Andrew
  • 256
  • 2
  • 4