Classification training data, but regression prediction

Question

Suppose I'm performing machine learning on a simple dataset, and have a bunch of training data of the form:

x (feature)   y (label)
-----------------------
1             0
2             1
3             1
4             0
5             1
6             1
...

Where the labels are values in two classes, $[0, 1]$. Clearly, this training data lends one to believe that it will be a classification task.

However, suppose I want to output instead the probability that a feature will take the class $1$. Then, my output is more of a regression task.

Consequently, when I'm designing a simple neural network with just a single input layer and single output layer, how many output units should I have? Should I have two output units, one for each class, and if so, how do I ensure that each pair of outputs will be a valid probability distribution (i.e. sum to one)? Or should I have only one output unit, and treat the entire problem as a regression task?

There are probably pros/cons to each approach... thanks for your help!

score 2 · Accepted Answer · answered Jul 01 '16 at 03:51

This is still a binary classification task. In the abstract, there are two ways to handle this:

Most classifiers can output a predicted class and a confidence score (which indicates how confident the classifier is in its prediction). If you don't need a probability, you can use the confidence score. If you want to turn it into a probability, you can use various calibration procedures to turn this into a probability in the range [0,1].
Some classifiers can output a probability directly. Generative models in particular are good at this. For instance, logistic regression outputs both a predicted class and a probability for the class.

If you construct a neural network with $k$ outputs, one per class, using a softmax output and train it to minimize the cross-entropy loss function, then you can interpret the output as a probability distribution on the classes (though beware that it might be biased or over-confident). The softmax ensures the outputs are normalized to be in the range [0,1] and to sum to one.

In your case, you can try both approaches (two outputs with softmax; or a single output with a sigmoidal activation function). The only way to know which will perform better is to try both, but personally, I'd lean towards using two outputs and softmax.

Classification training data, but regression prediction

1 Answers1