Intuition of $D\leftarrow XC^{T}\text{diag}(C1_n)^{-1}$ update rule in matrix factorization

Question

I am reading this paper where they use Matrix Factorization over Attention mechanism in their Hamburger model. In section 2.2.2 they say,

Vector Quantization (VQ) (Gray & Neuhoff, 1998), a classic data compression algorithm, can be formulated as an optimization problem in term of matrix decomposition: $$ \min _{\boldsymbol{D}, \boldsymbol{C}}\|\boldsymbol{X}-\boldsymbol{D} \boldsymbol{C}\|_F \quad \text { s.t. } \mathbf{c}_i \in\left\{\mathbf{e}_1, \mathbf{e}_2, \cdots, \mathbf{e}_r\right\}\tag1 $$ where $e_i$ is the canonical basis vector, $\mathbf{e}_i=[0, \cdots, 1, \cdots, 0]^{\top}$. The solution to minimize the objective in Eq. (1) is K-means (Gray & Neuhoff, 1998). However, to ensure that VQ is differentiable, we replace the hard arg min and Euclidean distance with softmax and cosine similarity, leading to Alg. 1, where $\operatorname{cosine}(\boldsymbol{D}, \boldsymbol{X})$ is a similarity matrix whose entries satisfy $\operatorname{cosine}(\boldsymbol{D}, \boldsymbol{X})_{i j}=\frac{\mathbf{d}_i^{\top} \mathbf{x}_j}{\|\mathbf{d}\|\|\mathbf{x}\|}$, and softmax is applied column-wise and $T$ is the temperature. Further, we can obtain a hard assignment by a one-hot vector when $T \rightarrow 0$.

I didn't understand the last line, "...and softmax is applied column-wise and $T$ is the temperature. Further, we can obtain a hard assignment by a one-hot vector when $T \rightarrow 0$." What they mean by $T$ (Temperature) here?

~~In fact, I couldn't get the justification for the replacement of $\arg \min$ with softmax here.~~ And what is their update rule doing for $D\leftarrow XC^{T}\text{diag}(C1_n)^{-1}$

~~And it seems like their VQ is similar to traditional non-negative matrix factorization with regularization on $C$. Or am I confused between these two?~~

Thanks in advance.

update

I remove some questions which might be not going with M.SE rules. And convert the thread with a single math-based question only. Hope it will get better reach now.

I hope this question fits with M.SE as my question is mostly related to matrix factorization and their update rule. Let me suggest if there is any other SE forum with fits most. Thanks — WhyMeasureTheory, Mar 14 '23 at 20:03
If I need to change or add more context, please let me know. I am struggling to understand this problem for 2 days. — WhyMeasureTheory, Mar 17 '23 at 10:54

score 2 · Answer 1 · answered Mar 19 '23 at 15:41

If $\mathbf{C}$ is known, then the second step minimozes $\phi(\mathbf{D}) = \| \mathbf{X-DC} \|_F^2$.

The gradient wrt $\mathbf{D}$ is $$ \frac{\partial \phi}{\partial \mathbf{D}} = (\mathbf{DC-X})\mathbf{C}^T $$ Setting the gradient to zero yields $$ \mathbf{D}(\mathbf{C}\mathbf{C}^T) = \mathbf{X}\mathbf{C}^T $$

In the hard assignment case, each column of $\mathbf{C}$ is the one-hot encoding of the cluster. Thus $\mathbf{C}\mathbf{C}^T$ is a diagonal matrix with dimensions $N_c\times N_c$ ($N_c$ is the number of clusters) and the diagonal contains the number of points per cluster. In this case, $\mathbf{C}\mathbf{C}^T =\mathrm{Diag}(\mathbf{C}\mathbf{1}_N)$ for $N$ points.

In the soft assignment case, this is not true any more but this expression maybe a good approximation. The updated centroids are found (approximately) by $$ \mathbf{D} = \mathbf{X}\mathbf{C}^T \mathrm{Diag}(\mathbf{C}\mathbf{1}_N)^{-1} $$

Regarding the 'temperature', the softmax function is not scale invariant as explained here The higher we scale the inputs, the more the largest input dominates the output. With increasing scale, the softmax function assigns a value close to 1 to the largest input value and 0 to all other values. This is caused by the nature of the exponential function, which grows the faster, the larger its input.

From $\phi(\mathbf{D})= | \mathbf{X-DC} |_F^2=Tr((\mathbf{X-DC})^T(\mathbf{X-DC}))=Tr((\mathbf{X^T-C^TD^T})(\mathbf{X-DC}))$, I get the gradient, $\frac{\partial \phi}{\partial \mathbf{D}}=-\mathbf{C}^T(\mathbf{X-DC})$ (can I exchange the position?) and can you say something about what one-hot encoding mean here? And thanks for your response, @Steph — WhyMeasureTheory, Mar 19 '23 at 16:31
There is only one non-zero element (equal to 1) in each column of $\mathbf{C}$. It indicates to which cluster the point belongs to. Terminology used in machine learning See https://en.wikipedia.org/wiki/One-hot for details. And no you cannot exchange the position of your gradient. You have to use 'mine'. — Steph, Mar 19 '23 at 17:08

score 1 · Answer 2 · answered Mar 18 '23 at 16:42

The authors of the paper are attempting to make the Vector Quantization (VQ) algorithm differentiable, which is necessary for gradient-based optimization in neural networks. VQ, in its original form, involves a hard assignment of data points to their nearest cluster centers, which is not differentiable due to the arg min operation and the Euclidean distance metric. To overcome this issue, the authors propose replacing the hard arg min and Euclidean distance with the softmax function and the cosine similarity metric, respectively.

The softmax function is used because it provides a continuous, differentiable approximation of the arg min operation. Instead of hard assignments, softmax generates a probability distribution over the cluster centers, which can be seen as "soft assignments". The temperature parameter $T$ in the softmax function controls the sharpness of this distribution. As $T$ approaches 0, the distribution becomes more like a one-hot vector, mimicking the hard assignment of the original VQ.

The update rule for $D\leftarrow XC^{T}\text{diag}(C1_n)^{-1}$ is derived from the optimization problem mentioned in the paper. The authors are trying to minimize the objective function $|\boldsymbol{X}-\boldsymbol{D} \boldsymbol{C}|_F$, and they have replaced the hard assignment with the soft assignment matrix $C$. This update rule is used to update the dictionary matrix $D$ in an iterative manner.

High-level explanation of the update rule:

$XC^T$ computes the weighted sum of data points for each cluster. The matrix $C$ contains the soft assignments, so this multiplication effectively computes the sum of data points assigned to each cluster, weighted by their assignment probabilities.
$C1_n$ computes the sum of soft assignment probabilities for each cluster.
$\text{diag}(C1_n)^{-1}$ creates a diagonal matrix with the inverse of the cluster sums from step 2 along the diagonal. This step computes the reciprocal of the sums, which will be used for normalization.
Multiplying the result from step 1 with the matrix from step 3 normalizes the weighted sums from step 1 by dividing them by the total assignment probabilities for each cluster. This operation computes the updated cluster centers (dictionary atoms) by taking the weighted average of the data points assigned to each cluster.

This update rule ensures a differentiable optimization process while still closely resembling the original VQ algorithm.

score 1 · Answer 3 · answered Mar 18 '23 at 17:04

I'm glad my previous answer was helpful. If you're looking for more sophisticated resources to learn about optimization and develop the ability to dissect optimization algorithms, I recommend the following books and online resources:

Convex Optimization by Stephen Boyd and Lieven Vandenberghe: This is a widely-used textbook in optimization and covers a broad range of topics in convex optimization, including duality, optimality conditions, and various optimization algorithms. It provides a solid foundation in optimization theory and techniques.
Book: https://web.stanford.edu/~boyd/cvxbook/
Online Course: https://www.youtube.com/playlist?list=PL3940DD956CDF0622
Numerical Optimization by Jorge Nocedal and Stephen J. Wright: This book covers various optimization algorithms, including both unconstrained and constrained optimization methods. It's a comprehensive resource for learning optimization techniques and understanding the underlying theory.
Book: http://users.iems.northwestern.edu/~nocedal/book/num-opt.html
Optimization for Machine Learning edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright: This book focuses on optimization methods specifically used in machine learning. It covers gradient-based optimization, subgradient methods, stochastic optimization, and more.
Book: https://mitpress.mit.edu/books/optimization-machine-learning
Lecture Notes on Optimization by Pravin Varaiya: These lecture notes provide a concise introduction to optimization techniques, covering both linear and nonlinear programming, as well as convex optimization.
Lecture Notes: https://people.eecs.berkeley.edu/~varaiya/Download/Varaiya-Optimization.pdf
EE227C: Convex Optimization and Approximation by Moritz Hardt and Benjamin Recht: These lecture notes and additional resources focus on convex optimization and approximation algorithms for machine learning and data science applications.
Course Notes: https://ee227c.github.io/

By studying these resources, you'll gain a deeper understanding of optimization techniques and be able to dissect optimization steps in various algorithms. Additionally, engaging with online communities such as the Machine Learning subreddit (https://www.reddit.com/r/MachineLearning/) or the AI section of arXiv (https://arxiv.org/list/cs.AI/recent) can help you stay up-to-date with the latest research and discussions on optimization and machine learning.

Intuition of $D\leftarrow XC^{T}\text{diag}(C1_n)^{-1}$ update rule in matrix factorization

update

3 Answers3