4

The LASSO problem can be formulated in two ways:

1) Constrained formulation: $$ \|Xw-y\|^2\to\min_{w}\\ \text{s.t. } \sum_{i}|w_i|\leq{t}. $$

2)Penalised formulation: $$ \|Xw-y\|^2+\lambda\left(\sum_{i}|w_i|\right)\to\min_w. $$

It is required to show that there exist such $t, \lambda$ that the two formulations are equivalent.

There is already such question on this site here. However, I am not convinced that the accepted answer is correct: it argues that the KKT conditions for the first form are equivalent to the first - order conditions for the second one for certain parameters' values. But, strictly speaking, the first - order conditions are not applicable here since the objective function is not differentiable everywhere for the second form.

Is the provided proof valid? It seems to me that a better way to proceed here would be to use the saddle - point argument:

Consider the constrained formulation. Since the objective and the constraints are convex and Slater's condition is satisfied, the pair $(w^*, \lambda^*)$ is primal-dual optimal iff it is a saddle-point of the Lagrangian: $$ \sup_{\lambda}\inf_wL(w,\lambda)=\inf_{w}\sup_{\lambda}L(w, \lambda)=L(w^*, \lambda^*),\\ L(w, \lambda)=||Xw-y||^2+\lambda(\sum_i|w_i|-t). $$ Moreover, we know that such a pair exists. Substituting $\lambda^*$ into the last equality, we obtain that the original problem is equivalent to $$ ||Xw-y||^2+\lambda^*(\sum_i|w_i|-\underbrace{t}_{\text{const}})\to\min_w.\\ ||Xw-y||^2+\lambda^*(\sum_i|w_i|)\to\min_w. $$ It is now obvious that, if in the second, penalised problem formulation we take $\lambda=\lambda^*$, the problems are exactly identical.

Vossler
  • 1,118
  • Are you sure you don't mean that for any $t$ there exists $\lambda$ such that they're equivalent? Because if you can choose any $t$, then it's pretty trivial... – Benjamin Lindqvist Mar 29 '16 at 11:10
  • And regardless, the KKT conditions hold if you replace the gradients being zero with the constraint that zero is contained in the subgradient. I even think this was mentioned in the post you linked. – Benjamin Lindqvist Mar 29 '16 at 11:16
  • I reject the premise of the problem anwyay. Yes, there are values of $t$, $\lambda$ for which the two produce an identical optimum. In fact there's an entire range of them. That doesn't meant the problems are equivalent in any useful sense, because in practice you're never going to know what the precisely "equivalence" relationship is between $\lambda$ and $t$ without actually solving problems. That is to say: if you fix $\lambda$ and solve (2), you'll get a value of $t$---but there's no cheaper way to find that $t$ in general. – Michael Grant Mar 30 '16 at 00:51
  • Of course, I probably should have said that when I commented on the original problem ;-) – Michael Grant Mar 30 '16 at 00:54
  • @benjaminlindqvist Concerning your first comment: yes, that's what I meant, of course. Second: also true, I suppose - I didn't remember about the subgradient generalisation at the time. Anyway, is this subgradient theory necessary here? It seems to me that in this case, since the problem is convex and the constraints satisfy Slater's condition, you can just substitute the dual solution into the saddle point definition and get equivalence straight away. – Vossler Mar 30 '16 at 22:05
  • @michaelgrant True, in practice there's no useful way of knowing the equivalence before solving; however, in practice you usually don't know the optimal value of the regulariser (or the constraint) anyway. What you usually do in practice is try a certain range of parameters and choose the one that performs best on your data; What this result tells you is that it doesn't matter for you which formulation to take and which parameter to optimise: you won't miss any solutions or acquire entirely different solutions by juggling formulations. – Vossler Mar 30 '16 at 22:09
  • @michaelgrant However, I probably should mention that my practical considerations come from the data analysis context; in different applications the exposition may be different and what I said may not apply. – Vossler Mar 30 '16 at 22:12
  • I can't say whether your saddle point approach would be correct without offering it up in detail. I can't say it is obvious to me (as much of a fan I am of using duality arguments) – Michael Grant Mar 30 '16 at 22:53
  • @michaelgrant I have added what I had in mind to the post. – Vossler Mar 30 '16 at 23:25
  • 1
    I don't think subgradient theory is necessary, but such a useful and straightforward weapon should of course be in your arsenal anyway, and once it is, the proof is almost a one-liner. But I agree with you both that saddle points are more fun. – Benjamin Lindqvist Mar 31 '16 at 05:43

0 Answers0