Why is automatically labeling data in unsupervised learning hard?

Question

I currently studying machine learning and pattern recognition area. Today, my professor said implementing an unsupervised system that automatically labels data is difficult. Why is that?

I think if I am given a data set, then I can categorize all data into groups. Then, for unknown input data, I can extract features and put it into the group that best fits all the features. Can anyone explain whats wrong with my intuition and why is it difficult problem?

score 6 · Answer 1 · edited Apr 13 '17 at 12:48

The problem with your intuition is that, frankly, you don't have one. At least not a useful, that is algorithmic one, one that tells you what is hard for computers and what is not. You think in terms of what you can do with every-day data -- but that's not an appropriate frame of reference for this problem. As evidence, consider this question.

Some specific concerns:

"I can" != "I can build an algorithm which can".
Can you? Look at some data dumps from CERN and try to classify them.
What exactly are "features" (mathematically)?
What does "fits best" mean (mathematically)?

I recommend three things that will help you build a (better) intuition.

Study computability and complexity theory -- this will help you build intution for hardness of computational problems.
Program things -- this will help you build intution for hardness of implementing things.
Try to build an unsupervised learner and fall into all the pitfalls.

score 0 · Answer 2 · answered Jun 12 '14 at 15:21

implementing any nontrivial machine learning problem is difficult! but there are some basic different levels of difficulty (a continuum/hierarchy so to speak). the professor is contrasting the problem with another problem. consider these two problems. image classification is a common challenging problem at the edge of ML feasibility but of course not the only kind of classification problem; the example is based on it.

given a set of images, find out distinct objects that are in the images. ie the result is a set of arbitrary choices about which category each images is in. no information is given about categories at all, not even how many categories there are.
given a set of images and finite set of labels ("categories"), classify the images based on the labels. one has training data such that the images are correctly classified. classify the images in the test set by choosing one of the finite known labels.

clearly in option B there is more data to work with, it is less arbitrary, so the ML algorithm can potentially be more successful, and also there are infinite possible labels in option A.

this basic difference in challenge is why the recent Google results that found image classifications without labels in the training data is considered such a dramatic milestone/ breakthrough in ML where even the tabloid headlines were not necessarily overhyped!

Why is automatically labeling data in unsupervised learning hard?

2 Answers2

Linked