Why do we try to maximize Lagrangian in SVMs?

Question

I was learning about support vector machines from MIT OpenCourseWare. I figured it out. I understand why we try to minimize $\frac{1}{2} w^2$. I just did not get why we try to maximize Lagrange expression like said at 35:56 in the YouTube video.

Margin (width) of the support vector is $2/\|w\|$. We want to maximize width it means that we also want to minimize $\|w\|$ from the equation $2/\|w\|$. It is true that we also want to minimize $\frac{1}{2}\|w\|^2$. Now we use Lagrange expression here, $$L = \frac{1}{2}\|w\|^2 + \sum_i α_i(y_i(w \bullet x+b)-1).$$ To find the minimum of $\frac{1}{2}\|w\|^2$, we apply gradient; $∇L = 0$. From the $∇L = 0$ equation, we get $Σ_iα_iy_i = 0$ and $w = Σ_iαi_iy_ix_i$. Then by using $Σ_iα_iy_i = 0$ and $w = Σ_iα_iy_ix_i$ equations and putting this in the Lagrange expression we end up with $$L(w,b,α) = \sum_iα_i-\frac{1}{2}\sum_{i,j}y_iy_jα_iα_j(x_i\bullet x_j).$$

I really understand up to here, but the professor said that we want to maximize $L$ here, why? I really don't understand. Why are we trying to maximize the Lagrange expression? I thought we were trying to minimize it.

score 4 · Accepted Answer · answered Apr 20 '18 at 05:02

I think the answer to this deals with convex functions and duality.

$$L(w, b, \alpha) = \frac{1}{2}\|w\|^2 + \sum_i α_i(y_i(w \bullet x+b)-1).$$

When you minimize this, you are minimizing it over $w,b$. As you stated, this will produce the Lagrangian dual function:

$\Theta (\alpha) = inf L(w, b, \alpha) = \sum_iα_i-\frac{1}{2}\sum_{i,j}y_iy_jα_iα_j(x_i\bullet x_j)$

The maximization must be done here, but of the function $\Theta (\alpha)$ (the Lagrangian dual function).

Here is some background on why we are maximizing:

1) Let $p*$ be the optimal value of the problem of minimizing $\frac{1}{2}\|w\|^2$ (the primal). The Lagrangian dual function has the property that $L(w, b, \alpha) \leq p*$. It is a lower bound on the primal function. Instead of solving the primal problem, we are want to get the maximum lower bound on $p*$ by maximizing the Lagrangian dual function (the dual problem).

2) The primal problem is a convex function. The Lagrangian dual function is a concave function(can be verified). Maximizing a concave function is equivalent to minimizing a convex function. Therefore it is easy.

3) Why maximizing $\Theta (\alpha)$ is as good as minimizing $\frac{1}{2}\|w\|^2$: Let $d*$ be the value of $max \Theta (\alpha)$. This primal convex optimization problem satisfies Slater's constraint qualification which gives us strong duality ($d*=p*$). Therefore, $max \Theta (\alpha)$ is the same as $min \frac{1}{2}\|w\|^2$.

score 0 · Answer 2 · edited Dec 17 '17 at 22:59

I cannot visit YouTube here, maybe you should show the expression.
Anyway, this Lagrange expression is the dual problem of origin expression. Minimizing it will more easy than minimizing the original expression.There is a good algorithm call SMO, which can maximize this Lagrange expression effectively.

Well, I refered to my book, found that the dual problem of the original one is
$$\text{max}(\alpha)\text{min}(w, b)L(w,b,\alpha)$$ while $max()$ and $min()$ mean that $argmax \alpha$ and $argmin (w, b)$ . The original expression is $$0.5\times|w|^2-\sum_{i}α_iy_i(w\times x_i+b)+\sum_iα_i$$ As you say we make $∇L = 0$ and we get $$∇(w)L(w,b,\alpha) = w-\sum_i\alpha_iy_ix_i = 0$$ and $$∇(b)L(w,b,\alpha) = \sum_i\alpha_iy_i = 0$$ so $$L(w,b,\alpha)$$ turns to be $$L(w,b,\alpha) = -0.5*\sum_{i}\sum_{j}\alpha_i\alpha_j\times y_i\times y_j\times K(x_i, x_j)+\sum_i\alpha_i$ while $K(x_i, x_j)$$ is Kernel Function. yes next we should minimize $$L(w,b,\alpha)$ as $min(w, b) L(w,b,\alpha)= -0.5\times \sum_i\sum_j\alpha_i\alpha_j\times y_i\times y_j\times K(x_i, x_j)+\sum_i\alpha_i$$ (you see, there is no $w$ and $b$ in this expression at all, so the minimize will be itself).
So at the end we get the final dual problem $$max(\alpha) -0.5\times\sum_i\sum_j\alpha_i\alpha_j\times y_i\times y_j\times K(x_i, x_j)+\sum_i\alpha_i$$ with s.t $$\sum_i\alpha_i\times y_i = 0$$, $\alpha_i \geq 0$ for $i = 1$ to $N$

I dont know how to prove that $max(\alpha)min(w, b)L(w,b,\alpha)$ is true, after all i don't have a major in mathematics, but it is true that if an original problem is a minimization problem, then its dual one is a maximization problem and vice versa. Maybe you could ask for proof from your professor.

score 0 · Answer 3 · answered Feb 17 '18 at 23:23

There's a typo in your first equation. It should have a minus in front of the term with $\alpha$s:

$$L = \frac{1}{2}\|w\|^2 - \sum_i α_i(y_i(w \bullet x+b)-1).$$

It's understandable where your confusion comes from. Like you say, you do indeed wish to minimise $L(w,b,\alpha)$ with respect to $w,b$.

But because there is a minus sign in front of the $\alpha$s, this is equivalent to maximising with respect to $\alpha$.

And this doesn't change in your second equation, where the $w,b$ terms have been eliminated.

Why do we try to maximize Lagrangian in SVMs?

3 Answers3