3

Suppose I'm performing machine learning on a simple dataset, and have a bunch of training data of the form:

x (feature)   y (label)
-----------------------
1             0
2             1
3             1
4             0
5             1
6             1
...

Where the labels are values in two classes, $[0, 1]$. Clearly, this training data lends one to believe that it will be a classification task.

However, suppose I want to output instead the probability that a feature will take the class $1$. Then, my output is more of a regression task.

Consequently, when I'm designing a simple neural network with just a single input layer and single output layer, how many output units should I have? Should I have two output units, one for each class, and if so, how do I ensure that each pair of outputs will be a valid probability distribution (i.e. sum to one)? Or should I have only one output unit, and treat the entire problem as a regression task?

There are probably pros/cons to each approach... thanks for your help!

manlio
  • 2,092
  • 20
  • 30
sir_thursday
  • 219
  • 2
  • 7

1 Answers1

2

This is still a binary classification task. In the abstract, there are two ways to handle this:

  • Most classifiers can output a predicted class and a confidence score (which indicates how confident the classifier is in its prediction). If you don't need a probability, you can use the confidence score. If you want to turn it into a probability, you can use various calibration procedures to turn this into a probability in the range [0,1].

  • Some classifiers can output a probability directly. Generative models in particular are good at this. For instance, logistic regression outputs both a predicted class and a probability for the class.

If you construct a neural network with $k$ outputs, one per class, using a softmax output and train it to minimize the cross-entropy loss function, then you can interpret the output as a probability distribution on the classes (though beware that it might be biased or over-confident). The softmax ensures the outputs are normalized to be in the range [0,1] and to sum to one.

In your case, you can try both approaches (two outputs with softmax; or a single output with a sigmoidal activation function). The only way to know which will perform better is to try both, but personally, I'd lean towards using two outputs and softmax.

D.W.
  • 167,959
  • 22
  • 232
  • 500