32

I see literature that suggests L2 loss and mean squared error loss are two different kinds of loss functions.

However, it seems to me these two loss functions essentially compute the same thing (with a 1/n factor difference).

So I am wondering if I have missed anything? Is there any scenario that one should use one of the two loss functions?

Mateen Ulhaq
  • 270
  • 1
  • 7
Edamame
  • 2,785
  • 5
  • 25
  • 34

7 Answers7

28

Function $L_2(x):=\left \|x \right \|_2$ is a norm, it is not a loss by itself. It is called a "loss" when it is used in a loss function to measure a distance between two vectors, $\left \| y_1 - y_2 \right \|^2_2$, or to measure the size of a vector, $\left \| \theta \right \|^2_2$. This goes with a loss minimization that tries to bring these quantities to the "least" possible value.

These are some illustrations:

  1. $L_p$ norm: $L_p(x) := \left \|x \right \|_p = (\sum_{i=1}^{D} |x_i|^p)^{1/p}$,
    where $D$ is the dimension of vector $x$,

  2. Squared error: $\mbox{SE}(A, \theta) =\sum_{n=1}^{N} \left \| y_n - f_{\theta}(x_n) \right \|^2_2$,
    where $A=\{(x_n, y_n)_{n=1}^{N}\}$ is a set of data points, and $f_{\theta}(x_n)$ is model's estimation of $y_n$,

  3. Mean squared error: $\mbox{MSE}(A, \theta) =\mbox{SE}(A, \theta)/N$,

  4. Least squares optimization: $\theta^*=\mbox{argmin}_{\theta} \mbox{MSE}(A, \theta)$$=\mbox{argmin}_{\theta} \mbox{SE}(A, \theta)$,

  5. Ridge loss: $\mbox{R}(A, \theta, \lambda) = \mbox{MSE}(A, \theta) + \lambda\left \| \theta \right \|^2_2$

  6. Ridge optimization (regression): $\theta^*=\mbox{argmin}_{\theta} \mbox{R}(A, \theta, \lambda)$.

In all of the above examples, $L_2$ norm can be replaced with $L_1$ norm or $L_\infty$ norm, etc.. However the names "squared error", "least squares", and "Ridge" are reserved for $L_2$ norm. For example for $L_1$, "squared error" becomes "absolute error":

  1. Absolute error: $\mbox{AE}(A, \theta) =\sum_{n=1}^{N} \left \| y_n - f_{\theta}(x_n) \right \|_1$,
Esmailian
  • 9,553
  • 2
  • 34
  • 49
12

They are different:

L2 = $\sqrt{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}$

MSE = $\frac{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}{N}$

There are sum and square root for L2-Norm, but sum and mean for MSE!

We can check it by following code:

import numpy as np
from sklearn.metrics import mean_squared_error

y = np.array(range(10, 20)) # array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) y_pred = np.array(range(10)) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) np.linalg.norm(y_pred - y, ord=2) # L2-Nomr: 31.622776601683793 mean_squared_error(y_pred, y) # MSE: 100.0

Belter
  • 221
  • 2
  • 4
10

To be precise, L2 norm of the error vector is a root mean-squared error, up to a constant factor. Hence the squared L2-norm notation $\|e\|^2_2$, commonly found in loss functions.

However, $L_p$-norm losses should not be confused with regularizes. For instance, a combination of the L2 error with the L2 norm of the weights (both squared, of course) gives you a well known ridge regression loss, while a combination of L2 error + L1 norm of the weights gives rise to a Lasso regression.

M0nZDeRR
  • 395
  • 3
  • 6
2

Belter is right, but, as observed by Toonia, we can see that: $$L_2 = \sqrt{N \times MSE}= \sqrt{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}$$

1

By the theory of Riemann integration, \begin{align*} \int_a^b |f(x)-g(x)|^2dx &= \lim_{n \to \infty} \sum_{k=1}^n |f(x_k)-g(x_k)|^2 \Delta x\\ &= \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^n |f(x_k) - g(x_k)|^2 \\ & \approx \frac{1}{n} \sum_{k=1}^n |f(x_k) - g(x_k)|^2 \end{align*} for $n$ sufficiently large. You can recognize the LHS as originating from the $L2$ norm while the RHS, MSE. If working on function spaces and point-wise evaluation of functions are considered then MSE essentially approximates squared $L2$ norm, for the difference. MSE on the other hand, is the squared norm modulo the dimension in finite dimensions. i.e., $$ ||y - \hat{y}||_2^2 = \sum_{k=1}^n |y_k - \hat{y}_k|^2\\ \text{MSE} = \frac{1}{n} ||\cdot||_2^2 $$ The difference, if there is one, is measure-theoretic.

Toonia
  • 11
  • 2
0

an L2 optimization and MSE optimization are equivalent

Lcat
  • 101
-2

I think for computation purpose we are using L2 norms. Because if we use MSE we have to use "for loop" and this will take more computation. But, on the other hand, we can use N2 norms by using matrix and this saves more computation for any programing language considering if we have a huge data. Overall, I think both are doing the same thing. Please correct me if I am wrong!