Efficiently labelling training data in machine learning

Question

Obviously it always depends on the specific case. However, my question is how to label data efficiently without writing a source code from scratch which solves the final problem itself?

For example, suppose that I want to use the $k$-Nearest Neighbours algorithm to classify tables from images according to their shape (square tables, rectangular tables, round tables).

For this purpose, I need to find let's say at least 500 images of various tables and then label them.

However, if I do not want to label them manually then I must write extensively a program to do this which essentially is the program I would like to have overall.

In other words, there is no point in writing from scratch an algorithm for accurately labeling data if I want to use machine learning finally because then this algorithm would solve my overall problem without machine learning. However, it is obvious that this algorithm would be very difficult to write to cover all the possible cases e.g. of tables from various images etc.

So how can I label data quickly without writing so many source code lines?

score 4 · Answer 1 · answered Jan 25 '18 at 01:10

You can't. You can't squeeze blood out of a stone. There's no such thing as a free lunch. You can't get something from nothing. If you want labelled images, you will need to label them yourself, or find some existing data set that comes with labels. There are no shortcuts.

Yes, this is tedious and labor-intensive. This is one of the less-well-known and less-glamorous aspects of working on machine learning: in practical projects, we often spend the majority of our time (or more!) just assembling data sets, and only a small fraction on the actual learning algorithms themselves.

I know you could do various shortcuts, like trying to write a quick program to label them. But if that program makes mistakes, you'll just be training your machine learning algorithm to make the same mistakes -- so that's not actually helpful.

If you want to reduce the amount of labelling effort, there are ways to reduce the effort -- but still they will require a significant amount of manual labelling. Nothing comes for free. For instance, you can use active learning algorithms to identify which instances to label. A simple example of that is to manually label a few hundred images as your initial training set, train a classifier, apply it to all remaining images, pick out the 20 images that the classifier is least confident on, manually label those 20 images, add them to the training set, and repeat. (This is an application of uncertainty sampling.) There are other, more sophisticated methods out there as well.

Another plausible approach is to somehow cluster the images, then manually label a few images from each cluster. This has issues, though, as you'll need some reasonable way to cluster the images, and that might be a non-trivial task. (One possible approach for clustering is to take some existing, pre-trained ImageNet classifier -- e.g., VGG, Inception, ResNet, etc. -- throw away the last layer or last two layers, and use the output before those layers as the input to some clustering algorithm like k-means. Doing k-means clustering directly on the raw image probably won't work well, but if you do it on the vector of activation values at some deep layer near the end of a good pre-trained classifier, then you might get better results.)

Finally, we often use data augmentation to make the classifier more robust. For instance, for each image in the training set, we might generate 10 more copies, where each copy is rotated or translated or cropped by a random amount, and add that to the training set. This tends to help make the classifier more robust to changes in orientation or pose -- but it doesn't give you something for nothing. You still need a large training set that contains many different kinds of tables.

score 0 · Answer 2 · answered Jan 24 '18 at 10:20

There exist non machine-learning approaches to feature detection in images, but those are generally complicated and not very successful in comparison with machine learning methods.

Labeling data somehow without involving the algorithm to be trained is a hard requirement when doing supervised learning. As most decent automated labeling methods would use machine learning, this seems like a bootstrapping issue.

There also exists unsupervised learning methods, e.g. clustering approaches such as $k$-means. But most of these are unsuitable for very complex classification tasks such as image recognition. (Sure, Google did a reasonable large unsupervised image classification a few years a go, but they required computational resources that aren't available to most of us)

In the end, image recognition and other hard machine learning tasks are approached with ML methods precisely because humans are a lot better at them than computers, so we hope to 'learn' the computer to make decisions similar to the human. This means that, unfortunately, human effort is a required to train these algorithms.

score 0 · Answer 3 · answered Jan 24 '18 at 12:26

0

What you are looking for is very similar to a clustering task. So you should look into clustering algorithms for images (in the field of unsupervised learning), for example, the JULE algorithm (here and here) from 2016 does that sort of thing.

answered Jan 24 '18 at 12:26

David Taub

632
4
11

score 0 · Answer 4 · answered Jan 26 '18 at 06:11

0

It is standard practice to label stuff by a human in the beginning. You could assist the human, but there is always a human component in creating a new dataset.

Interesting things to mention:

AWS mechanical Turk
Online learning
Active learning

answered Jan 26 '18 at 06:11

Martin Thoma

2,360
1
22
41

Efficiently labelling training data in machine learning

4 Answers4

Linked