4

I am currently attempting to derive the dual and KKT conditions for the following optimization problem:

\begin{equation} \min_{x\in \mathbb R^n} \quad x^TMx+a ||x||_1^2 \quad \text{subject to} \quad ||x||_2=1. \end{equation}

Regrettably, I haven't made any significant progress thus far. I am open to making assumptions on matrix $M$ and $a>0$ if it helps in deriving meaningful conditions. Any guidance or insights would be greatly appreciated. The case without $l_1$ norm is a well-studied problem (e.g., see 1 and 2).

Amir
  • 11,124
Sam
  • 323

1 Answers1

2

To derive the dual and KKT conditions, I use $\|x\|^2_2=1$ instead of $\|x\|_2=1$ for computational simplicity. There is no restriction on $a$ and $M$ in the following, so the objective can be non-convex.

Also, recall the following facts:

$$x^TMx= x^T\left(\frac{M^T+M}{2} \right)x, \tag{1}$$

$$\|x\|_2 \le \|x\|_1 \le \sqrt {n} \|x\|_2 \tag{2}$$

and the notation $[a]^-=\min(a,0)$ and $[a]^+=\max(a,0).$

From (1), the problem can be written as

\begin{equation} \min_{x\in \mathbb R^n} \quad x^T\left(\frac{M^T+M}{2} \right)x+a ||x||_1^2 \quad \text{subject to} \quad ||x||^2_2=1. \end{equation}

Dual problem

From (2), we have

$$x^T\left(\frac{M^T+M}{2} \right)x+a \|x\|_1^2+\mu\|x\|^2 \ge \\ x^T \left (\frac{M^T+M}{2}+\left ([a]^- n+[a]^++\mu \right )I \right)x.$$

If $\frac{M^T+M}{2}+\left ([a]^- n+[a]^++\mu \right )I \succeq 0,$ the minimum of the RHS is zero, which is attained at $x=0$ in both sides. If $\frac{M^T+M}{2}+\left ([a]^- n+[a]^++\mu \right )I \nsucceq 0$, for some $i$ there exists a direction $+e_i$ or $-e_i$ on which the RHS tends to $-\infty$. Over such a direction both LHS and RHS are equal, so the LHS also tends to $-\infty$ on this direction. Hence, we have

$$g(\mu)=\inf_{x\in \mathbb R^n}\bigg \{l(x,\mu)=x^T\left(\frac{M^T+M}{2} \right)x+a \|x\|_1^2+\mu(\|x\|^2_2-1) \bigg \}=\begin{cases} -\mu& \frac{M^T+M}{2}+ \left ([a]^- n+[a]^++\mu \right )I \succeq 0 \\ -\infty& \text{otherwise} \end{cases}$$

Thus, the dual of the problem is

\begin{equation} d^*=\max_{\mu\in \mathbb R} g(\mu) \equiv \color{blue}{\max_{\mu\in \mathbb R: \, \frac{M^T+M}{2}+\left ([a]^- n+[a]^++\mu \right )I \succeq 0 } \, -\mu}. \end{equation}

From (2), we have

$$\frac{M^T+M}{2}+\left ([a]^- n+[a]^++\mu \right)I \succeq 0 \Leftrightarrow \\ [a]^- n+[a]^++\mu+\lambda_{\min}\left(\frac{M^T+M}{2} \right)\ge 0,$$ which yeilds $$d^*=[a]^- n+[a]^++\lambda_{\min}\left(\frac{M^T+M}{2} \right).$$

From the weak duality $d^* \le p^*$, this provides a lower bound for the primal optimal value $p^*$. This bound is obtained in a different way in the following.

Bounds on the optimal value

Considering the known result (1 and 2) :

$$\lambda_{\min}(A)=\min_{x\in \mathbb R^n: \|x\|_2=1} \quad x^TAx $$

for any symmetric matrix $A$, and (2), we can directly bound $p^*$ as follows:

$$\color{blue}{\lambda_{\min}\left(\frac{M^T+M}{2} \right)+[a]^- n+[a]^+ \le p^* \le \lambda_{\min}\left(\frac{M^T+M}{2} \right) +[a]^+ n+[a]^-}.$$

The lower bound is the bound already obtained from the dual problem.

KKT conditions

The ordinary KKT conditions cannot be used directly as the objective is not differentiable, and working with the extended KKT conditions, in which sub-gradients are used, seems difficult as the objective function is not necessarily convex and consequently the subdifferential of the Lagrangian used in the stationarity condition cannot be determined easily (recall that generally $\sum_i ∂f_i⊆∂∑_if_i$ and we have $\sum_i ∂f_i=∂∑_if_i$ if all the functions $f_i$ are convex).

Here, I use a different trick to overcome the above restrictions on using KKT conditions to our problem. Note that our optimization problem can decomposed into the following $2^n$ subproblems:

\begin{equation} \min_{x,z\in \mathbb R^n} \quad x^TMx+a \left (\sum_{i=1}^n b_ix_i \right)^2 \\ \text{subject to:} \quad \|x\|^2_2=1, -b_ix_i \le 0, i=1,\dots, n. \end{equation}

defined for any constant vector $b \in \{-1,1 \}^n.$ The ordinary KKT conditions can be applied to each of the above problems. At each feasible point, the linear independence constraint qualification holds because the gradients of all active constraints at $x$ are linearly independent (note that maximum $n-1$ of the constraints $-b_ix_i \le 0, i=1,\dots, n$ can be activate at any feasible point considering $\|x\|^2_2=1$).

For each of the above problem, the KKT conditions yield

$$(M^T+M)x +2a \left (\sum_{i=1}^n b_ix_i \right) b -u_b \odot b + 2\mu_b x=0, \, x \odot u=0 , $$

for $\mu_b \in \mathbb R, u_b \in \mathbb R^n_{\,\ge 0}$, where $\odot$ denotes the Hadamard product. We can see that for $-b$ instead of $b$, the above is given

$$(M^T+M)x +2a \left (\sum_{i=1}^n b_ix_i \right) b + u_{-b} \odot b + 2\mu_{-b} x=0, \, x \odot u=0 , $$

for $\mu_{-b} \in \mathbb R, u_{-b} \in \mathbb R^n_{\,\ge 0}$. Thus, the two systems obtained for each pair of $b$ and $-b$ can be combined as follows:

$$(M^T+M)x +2a \left (\sum_{i=1}^n b_ix_i \right) b + u_{\mp b} \odot b + 2\mu_{\mp b} x=0, \, x \odot u_{\mp b} =0 , $$

for $\mu_{\mp b} \in \mathbb R, u_{\mp b} \in \mathbb R^n$. Now, here we only have $2^{n-1}$ cases. The above can be simplified future as

$$(M^T+M)x +2a \left (\sum_{i=1}^n b_ix_i \right) b + u_{\mp b} \odot (1- b \odot b)+ 2\mu_{\mp b} x=0 \tag{3}, $$

in which $u_{\mp b} \odot b$ is replaced with $u_{\mp b} $ (note that $u_{\mp b}$ is urs) and the CS condition $x \odot u_{\mp b} =0$ is reflected in the term $u_{\mp b} \odot (1- b \odot b)$.

The system (3) needs to be examined for any pair of $b$ and $-b$. Hence, it can be capsulated as follows

$$(M^T+M)x +2a \|x\|_1 \text{sgn}(x)+ u \odot \left [1- \text{sgn}(x) \odot \text{sgn}(x) \right ]+ 2\mu x=0 , \tag{4} $$

for $\mu \in \mathbb R, u \in \mathbb R^n$.

In (4), if $x_i\neq 0$, then $u_i(1- \text{sgn}^2(x_i))=0$. If $x_i=0$, then $u_i(1- \text{sgn}^2(x_i))=u_i$, which means that the $i$th row of the system (4) is redundant because it is always satisfied for any $x$ considering $u_i \in \mathbb R$. Hence, alternatively, we can write (4) as follows:

$$\color{blue}{\left [ (M^T+M)x +2a \|x\|_1 \text{sgn}(x) +2\mu x) \right ] \odot \text{sgn}(x) =0}, \tag{5}$$

in which the $i$th row is automatically removed if $x_i=0.$ Whenever $M+M^T$ is a diagonal matrix or $a=0$, (5) can be simplified as

$$(M+M^T)x +2a \|x\|_1 \text{sgn}(x) +2\mu x =0, \tag{6}.$$

Indeed, when $M+M^T$ is a diagonal matrix, if $x_i=0$, all elements of the $i$th row become $0$, so there is no need to multiply it with $\text{sgn}(x_i)$.

Note that the optimization problem always has a global minimizer because the feasible set is compact (bounded and closed) and the objective is a continuous function (Weierstrass extreme value theorem). On the other hand, the constraint qualification holds at any feasible point of each subproblem (thus the KKT conditions become necessary). Thus, the minimizer is among the solutions of the system (5).

I evaluate systems (5) and (6) in four different cases below.

Simple case 1:

For $a=0$ and $M \neq 0$, we obtain

$$\left(\frac{M^T+M}{2} \right)x= -\mu x,$$

so each normalized eigenvector $x$ (with $\|x\|_2=1$) for the matrix $\frac{M^T+M}{2}$ is a feasible KKT point. Among them, the minimizer is the normalized eigenvector that corresponds to the minimum eigenvalue $\lambda_{\min}\left(\frac{M^T+M}{2} \right)=-\mu$.

Simple case 2:

For $M=0$ and $a \neq 0$, we obtain

$$2a \|x\|_1 \text{sgn}(x)=-2\mu x.$$

Hence, $x$ can be each of the following vectors:

$$e_1,\dots , e_n$$

with $\frac{-2\mu}{a}=1.$ Here, we have $n$ feasible KKT points, and all of them are optimal solutions.

Simple case 3:

For $M=0, a=0$, $x$ can be any vector with $\|x\|_2=1$.

Main case:

When $M \neq 0, a\neq 0$, we can rewrite equation (5) as follows:

$$\color{blue}{\left [ \left(\frac{M^T+M}{2} + 2 a \, \text{sgn}(x)\text{sgn}(x)^T \right)x+\mu x \right ] \odot \text{sgn}(x) =0.} \tag{7} $$

For any fixed $\text{sgn}(x) \in \{-1,1\}^n$, one can see that the solutions of the above system are among the eigen pairs of the matrix $$\frac{M^T+M}{2} + 2 a \, \text{sgn}(x)\text{sgn}(x)^T.$$ After finding a normalized eigenvector of the matrix, if its sign agrees with the fixed sign $\text{sgn}(x)$, then it is a solution of (7). If some of $x_i$ are zero, again we have the same situation for a smaller system after reducing the size of the system.

It seems that direct search (checking all the possibilities of $\text{sgn}(x) \in \{-1,0,1\}^n$) is the only way to find all solutions of (7), and it is somehow expected as the objective is not differentiable at any point $x$ with $x_i=0$ for some $i$. However, I guess the number of cases that the system has a solution is of order $O(n)$ though we have an exponential number of different cases. If my guess is correct, there may be way to efficiently find the feasible solutions. I asked a question on this guess here and hope I get some feedback.

Amir
  • 11,124
  • Thank you very much for your time and help. I think in your dual problem, we should have $\mu \ge 0$. – Sam Mar 09 '24 at 17:52
  • @Sam you are welcome! I just fixed the typo. – Amir Mar 09 '24 at 21:41
  • The term in your equivalent objective function $\sum_i z_i^2$ should be $\left( \sum_i z_i \right)^2$. – V.S.e.H. Mar 12 '24 at 01:09
  • Thank you @V.S.e.H. I just fixed the typo . You may see that the capsulated equation was derived from the correct equivalent objective. – Amir Mar 12 '24 at 02:50
  • Your dual function seems incorrect. For example, if $M=I, a = 1$, then $g(\mu) = \inf_{x\in\mathbb{R}^d} { |x|2^2 + |x|_1^2 + \mu|x|_2 - \mu} \geq \inf{x\in\mathbb{R}^d} { 2|x|_2^2 +\mu |x|_2 - \mu } > 0$ for all $\mu \in (-8,0)$ (check discriminant of quadratic). – V.S.e.H. Mar 12 '24 at 16:51
  • @V.S.e.H. Thanks for your help. Please check the updated answer. I consider $|x|^2_2=1$ instead of $|x|_2=1$ to derive the dual and KKT conditions. – Amir Mar 12 '24 at 18:50
  • Isn't the dual problem simply: $\quad \max \ \lambda_{\min}(\frac{M+M^T}{2}+([a]^-\sqrt{n}+[a]^+)I)$? – Sam Mar 14 '24 at 06:34
  • @Sam you may note that the dual problem is a maximization problem , i.e., $$\max_{\mu\in \mathbb R: , M+\left ([a]^- \sqrt {n}+[a]^++\mu \right )I \succeq 0 } , -\mu.$$ in which the decision variable is the only Lagrangian multiplier $\mu$ (as the primal problem has only one constraint). The quantity $d^=[a]^- \sqrt {n}+[a]^++\lambda_{\min}\left(\frac{M^T+M}{2} \right)$ is the optimal value of the dual problem (attained at $\mu=-d^$), which is independent of both primal variables ($x$) and dual variable ($\mu$) (it is constant and cannot be optimized anymore). – Amir Mar 14 '24 at 06:58
  • I'm a bit confused with your KKT conditions. Have you considered the KKT multipliers for the inequalities (which should be nonnegative) in (3)? It seems to me that you just calculated the gradient of the Lagrangian wrt $x$. I'm also missing the complementary slackness condition (unless you somehow incorporated it in (3)). – V.S.e.H. Mar 15 '24 at 14:32
  • @V.S.e.H. Thanks for your feedback! As the objective is not differentiable, we need first write the conditions for the inequality constraints, and then combine the difference of two non-negative multipliers for each $i$ as a free multiplier. As at optimality, we have $|x_i|=x_i$, after combining, for the new multiplier the complementary slackness condition is implicitly considered. – Amir Mar 15 '24 at 15:12
  • The part $|x_i| = x_i$ makes no sense, unless $x_i \geq 0$, which need not be true? Did you mean $z_i = \operatorname{sgn}(x_i)x_i$? And how do you "combine the difference" to get a free multiplier? Explain with clear steps. Also, the KKT conditions need not hold, even if the problem is locally optimal at some points. You need a qualification condition, such as LICQ, Slater (for convex), etc. – V.S.e.H. Mar 15 '24 at 15:29
  • @V.S.e.H. It is a typo, sorry as I am out and usung my phone to respond. I meant $|x_i|=z_i$. The special solution obtained by removing the gradient of the objective (x=0) cannot be feasible (see Fritz John conditions), so we search for the optimal solution among the KKT points (no need to check constraint qualification here). As I get home, I provide more details. – Amir Mar 15 '24 at 16:01
  • There may be no KKT points to search among. This is why you need a constraint qualification if you are to use the KKT conditions, and you have not provided that in your answer. Fritz John are optimality conditions, not constraint qualification conditions. – V.S.e.H. Mar 15 '24 at 16:13
  • @V.S.e.H. I know what is Fritz John, and its difference with constraint qualification. Instead of KkT conditions, we can use Fritz John, in which no constraint qualification is needed. These conditions generate points by removing the gradient of the objective, in addition to KKT points. The additional point here cannot be feasible, so we focus on KKT points only. Note that as the feasible set is compact and the objective is continuous, we have at least one KKT point as we have an optimal solution. – Amir Mar 15 '24 at 16:21
  • Dear @Amir, are you willing to discuss this problem further, and why I came up with this question? I would be more than happy to collaborate with you on this research question. If you are interested, please let me know how to contact you. Thanks. – Sam Mar 20 '24 at 23:40
  • I am hoping that you have seen my previous comment @Amir regarding collaboration. By the way, I am a scholar at American University. – Sam Mar 21 '24 at 00:20
  • Thank you @Sam for the invitation! You may contact me at my email address. – Amir Mar 21 '24 at 02:09
  • @Amir Thank you for the email address. I just emailed you! Appreciate it! – Sam Mar 21 '24 at 06:00