8

I've understood that SVMs are binary, linear classifiers (without the kernel trick). They have training data $(x_i, y_i)$ where $x_i$ is a vector and $y_i \in \{-1, 1\}$ is the class. As they are binary, linear classifiers the task is to find a hyperplane which separates the data points with the label $-1$ from the data points with the label $+1$.

Assume for now, that the data points are linearly separable and we don't need slack variables.

Now I've read that the training problem is now the following optimization problem:

  • ${\min_{w, b} \frac{1}{2} \|w\|^2}$
  • s.t. $y_i ( \langle w, x_i \rangle + b) \geq 1$

I think I got that minizmizing $\|w\|^2$ means maximizing the margin (however, I don't understand why it is the square here. Would anything change if one would try to minimize $\|w\|$?).

I also understood that $y_i ( \langle w, x_i \rangle + b) \geq 0$ means that the model has to be correct on the training data. However, there is a $1$ and not a $0$. Why?

Martin Thoma
  • 19,540
  • 36
  • 98
  • 170

1 Answers1

11

First problem: Minimizing $\|w\|$ or $\|w\|^2$:

It is correct that one wants to maximize the margin. This is actually done by maximizing $\frac{2}{\|w\|}$. This would be the "correct" way of doing it, but it is rather inconvenient. Let's first drop the $2$, as it is just a constant. Now if $\frac{1}{\|w\|}$ is maximal, $\|w\|$ will have to be as small as possible. We can thus find the identical solution by minimizing $\|w\|$.

$\|w\|$ can be calculated by $\sqrt{w^T w}$. As the square root is a monotonic function, any point $x$ which maximizes $\sqrt{f(x)}$ will also maximize $f(x)$. To find this point $x$ we thus don't have to calculate the square root and can minimize $w^T w = \|w\|^2$.

Finally, as we often have to calculate derivatives, we multiply the whole expression by a factor $\frac{1}{2}$. This is done very often, because if we derive $\frac{d}{dx} x^2 = 2 x$ and thus $\frac{d}{dx} \frac{1}{2} x^2 = x$. This is how we end up with the problem: minimize $\frac{1}{2} \|w\|^2$.

tl;dr: yes, minimizing $\|w\|$ instead of $\frac{1}{2} \|w\|^2$ would work.

Second problem: $\geq 0$ or $\geq 1$:

As already stated in the question, $y_i \left( \langle w,x_i \rangle + b \right) \geq 0$ means that the point has to be on the correct side of the hyperplane. However this isn't enough: we want the point to be at least as far away as the margin (then the point is a support vector), or even further away.

Remember the definition of the hyperplane,

$\mathcal{H} = \{ x \mid \langle w,x \rangle + b = 0\}$.

This description however is not unique: if we scale $w$ and $b$ by a constant $c$, then we get an equivalent description of this hyperplane. To make sure our optimization algorithm doesn't just scale $w$ and $b$ by constant factors to get a higher margin, we define that the distance of a support vector from the hyperplane is always $1$, i.e. the margin is $\frac{1}{\|w\|}$. A support vector is thus characterized by $y_i \left( \langle w,x_i \rangle + b \right) = 1 $.

As already mentioned earlier, we want all points to be either a support vector, or even further away from the hyperplane. In training, we thus add the constraint $y_i \left( \langle w,x_i \rangle + b \right) \geq 1$, which ensures exactly that.

tl;dr: Training points don't only need to be correct, they have to be on the margin or further away.

hbaderts
  • 1,124
  • 8
  • 21