5

I am trying to solve (approximately) the following maximization problem:

$$ \max_{U_i, V_j} \sum_{i,j} \frac{\operatorname{trace}^2(A_{ij}^\top U_i V_j)}{\|U_i V_j\|_F^2}, $$

where:

  • $ i, j \in \{1,2\} $,
  • $ \|\cdot\|_F $ denotes the Frobenius norm,
  • $ A_{ij} $ is an $ n \times n $ matrix,
  • $ U_i $ is an $ n \times r $ matrix,
  • $ V_j $ is an $ r \times n $ matrix,
  • and $ r \le n $.

My Attempt:

I tried to approach this problem by considering the SVD of a larger $ (2n) \times (2n) $ matrix $ A $ formed as follows:

$$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}. $$

Suppose the SVD of $ A $ is $ P D Q $ (I used $Q$ rather than $Q^\top$ intentionally), where:

  • $ P = \begin{bmatrix} P_1 & P_1' \\ P_2 & P_2' \end{bmatrix} $,
  • $ D = \operatorname{blockdiag}(\Sigma, \Sigma') $,
  • $ Q = \begin{bmatrix} Q_1 & Q_1' \\ Q_2 & Q_2' \end{bmatrix} $,
  • $ \Sigma $ and $ \Sigma' $ are diagonal matrices, with $ \Sigma $ being $ r \times r $,
  • $ P_1 $ and $ Q_1 $ are $ n \times r $ and $ r \times n $, respectively.

Based on this decomposition, I set $ U_i = P_i $ and $ V_j = \Sigma Q_j $. Substituting this into the objective function, I obtained:

$$ \sum_{i,j}\frac{\left( \|P_i \Sigma Q_j\|_F^2 + \operatorname{trace}\left((P'_i \Sigma' Q'_j)^\top P_i \Sigma Q_j\right) \right)^2}{\|P_i \Sigma Q_j\|_F^2}. $$

However, I am unsure how to proceed from here. Specifically:

  1. Is this approach viable, or am I going in the wrong direction?
  2. If it is viable, how can I simplify the expression and compute the maximization?
  3. If this method is not appropriate, are there alternative techniques to tackle this maximization problem?

I would greatly appreciate any guidance or insights into solving this problem.

Alex
  • 883

1 Answers1

4

$ \def\t{\times} \def\R#1{{\mathbb R}^{#1}} \def\a{\alpha} \def\b{\beta} \def\s{\sigma} \def\l{\lambda} \def\lR#1{\Big(#1\Big)} \def\bR#1{\Big[#1\Big]} \def\LR#1{\left(#1\right)} \def\BR#1{\left[#1\right]} \def\CR#1{\left\lbrace #1 \right\rbrace} \def\op#1{\operatorname{#1}} \def\vc#1{\op{vec}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\cross#1{\,\left[\,#1\,\right]_\times} \def\frob#1{\left\| #1 \right\|_F} \def\q{\quad} \def\qq{\qquad} \def\qif{\q\iff\q} \def\qiq{\q\implies\q} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\m#1{\left[\begin{array}{c}#1\end{array}\right]} \def\mb#1{\left[\begin{array}{c|c}#1\end{array}\right]} \def\rr#1{\color{red}{#1}} \def\gg#1{\color{green}{#1}} \def\bb#1{\color{blue}{#1}} \def\RLR#1{\rr{\LR{#1}}} \def\BLR#1{\bb{\LR{#1}}} \def\GLR#1{\gg{\LR{#1}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \def\ai{\a^{-1}} \def\aa{\a^2} \def\bi{\b^{-1}} \def\bbi{\b^{-2}} \def\T{{\sf T}} $Gather all of the indexed matrices into block matrices $$\small\eqalign{ U = \m{U_1 \\ U_2 \\ \vdots \\ U_p}, \q V = \m{V_1\;V_2 \ldots\;V_p}, \q A = \m{ A_{11} & A_{12} & \ldots & A_{1p} \\ A_{21} & A_{22} & \ldots & A_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ A_{p1} & A_{p2} & \ldots & A_{pp} \\ } \in\R{pn\t pn} \\ }$$ Then define these auxiliary variables $$\eqalign{ B &= UV &\qiq dB = dU\:V + U\:dV \\ \a &= A:B &\qiq \gg{d\a = A:dB} \\ \b &= B:B &\qiq \rr{d\b = 2B:dB} \\ }$$ where a colon denotes the Frobenius product which has the following properties $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ B:B &= \frob{B}^2 \qquad \{ {\sf Frobenius\;norm} \}\\ A:B &= B:A \;=\; B^T:A^T \\ \LR{PQ}:B &= P:\LR{BQ^T} \;=\; Q:\LR{P^TB} \\ \\ }$$


Rewrite the objective function using the above notation and calculate its differential $$\eqalign{ \def\smB{{\small B}} \l &= \bi\aa \\ d\l &= \bi(2\a\,\gg{d\a}) \;-\; \aa\bbi\,\rr{d\b} \\ &= \bi(2\a\gg{A:dB}) \;-\; \aa\bbi\,\RLR{2B:dB} \\ &= \bb{2\bbi\a{\lR{\b A-\a B}}}:{dB} \\ &\bb{\equiv G}:{dB} \\ &= G:\LR{dU\:V + U\:dV} \\ &= GV^T:dU \;+\; U^TG:dV \\ }$$ from which the gradients are readily identified $$\eqalign{ \grad{\l}{U} &= G V^T \qq \grad{\l}{V} &= U^T G \qq\qq\qq }$$ With the gradients in hand, we can apply gradient ascent (since this is a $\max$ problem) and alternate between solving for $U$, then for $V$, then for $U,\,\ldots$ until convergence is achieved $$\eqalign{ U_{k+1} &= U_k + \tau_k \gradLR{\l_k}{U_k} &\qiq U_{k+1}=\frac{U_{k+1}}{\|U_{k+1}\|} \\ V_{j+1} &= V_j + \s_j \gradLR{\l_j}{V_j} &\qiq V_{j+1}=\frac{V_{j+1}}{\|V_{j+1}\|} \\ }$$ where $\CR{\s_j,\tau_k}$ denote the steplengths in the respective ascent algorithms, and the subscripts denote the iteration count, rather than the partition of the block matrix.

Since the function value depends only on the matrix directions and not their magnitudes $$\eqalign{ &\l(U,V) \;=\; \l\!\LR{\frac{U}{\|U\|},\,\frac{V}{\|V\|}} \\ }$$ normalizing $U$ and $V$ on every step will prevent runaway growth of the matrices.


Another idea is to set the gradients to zero, and solve for each matrix variable in terms of the pseudoinverse of the other, i.e. $$\eqalign{ \def\g{\gamma\,} U &= \g AV^{\bf+} &\qiq U=\frac{U}{\|U\|} \\ V &= \g U^{\bf+}\!A &\qiq V=\frac{V}{\|V\|} \\ }$$ where $\g=\large\LR{\frac{UV\,:\,UV}{\:A\::\,UV}}\,$ generates the correct gradient, but since we normalize immediately afterward, setting $\g={\tt1}$ also works and avoids unnecessary computations.

Once again, you hold $V$ constant and solve for $U$, then hold $U$ constant and solve for $V$, then solve for $U$, then for $V\:\ldots$ until convergence.

This is Alternating Least Squares and converges faster than gradient ascent/descent.


After testing a few small systems, this problem behaves like Nonnegative Matrix Factorization, i.e. it probably has a global optimum (which is $\sf NP$-hard to find) and it has lots of practical optima which are very easy to compute.

greg
  • 40,033