Derivation of the method of Lagrange multipliers?

Question

I've always used the method of Lagrange multipliers with blind confidence that it will give the correct results when optimizing problems with constraints. But I would like to know if anyone can provide or recommend a derivation of the method at physics undergraduate level that can highlight its limitations, if any.

@John, you may or may not find the answers to this similar question helpful: http://math.stackexchange.com/q/674/400 — Vladimir Sotirov, Feb 27 '11 at 06:03
Relevant thread: https://math.stackexchange.com/questions/1760709/how-to-prove-lagrange-multiplier-theorem-in-a-rigorous-but-intuitive-way — littleO, May 09 '25 at 07:10

Arturo Magidin · Answer 1 · 2011-02-27T05:21:34.200

47

Lagrange multipliers are used to obtain the maximum of a function $f(\mathbf{x})$ on a surface $\{ \mathbf{x}\in\mathbb{R}^n\mid g(\mathbf{x}) = 0\}$ (I use "surface", but whether it is a 2-dimensional, 1-dimensional, or whatever-dimensional object will depend on the $g$ and the $\mathbb{R}^n$ we are dealing with).

The gradient of $f$, $\nabla f$, points in the direction of greatest increase for $f$. If we want to find the largest value of $f$ along $g$, then we need the direction of greatest increase to be orthogonal to $g$; otherwise, moving along $g$ will "capture" some of that increase and $f$ will not achieve its maximum among $g$ at that point (this is akin to the fact that in one-variable calculus, the derivative should be $0$ at the maximum, otherwise moving a bit will increase in one direction will increase the value of the function).

In order for $\nabla f$ to be perpendicular to the surface, it must be parallel to the gradient of $g$; so $\nabla f$ must be a scalar multiple of $\nabla g$. So this amounts to finding a solution to the system \begin{align*} \nabla f(\mathbf{x}) &= \lambda \nabla g(\mathbf{x})\\ g(\mathbf{x}) &= 0 \end{align*} for both $\mathbf{x}$ and $\lambda$.

Added. Such a point is not guaranteed to be a maximum or a minimum; it could also be a saddle point, or nothing at all, much as in the one-variable case, points where $f'(x)=0$ are not guaranteed to be extremes of the function. Another obvious limitation is that if the surface $g$ is not differentiable (does not have a well-defined gradient) then you cannot even set up the system.

edited Feb 27 '11 at 05:21

answered Feb 27 '11 at 04:58

Arturo Magidin

417,286

2

+1, although I would add, as a nod to the OP's request to "highlight its limitations," that not every solution to the system is guaranteed to be a maximum or minimum of $f$ on $g(\mathbf{x}) = 0$ (as with the single-variable case with the derivative being zero). – Mike Spivey Feb 27 '11 at 05:17
@Mike Good point. – Arturo Magidin Feb 27 '11 at 05:19
1

Very easy to understand, thanks. – John McVirgooo Feb 27 '11 at 05:20
An answer by @ArturoMagidin! I think I will upvote without reading it. And then I'll read it. – badatmath Nov 15 '12 at 00:33
Can you please elaborate on what you mean by "The gradient of $f$, $\nabla f$, points in the direction of greatest increase for $f$."? Do you mean that $\nabla f$ points in the direction of the greatest increase for $f$ at the point where $f$ is maximal? – M Smith Dec 02 '15 at 16:29
Nice answer, but I think you have an extra copy of the words "will increase" – J. W. Tanner Feb 03 '20 at 01:02
2

@J.W.Tanner: Yes, but I'm not going to bump an 11-year-old question to correct a bit of grammar that does not obscure the meaning nor misleads the reader. – Arturo Magidin Feb 03 '20 at 01:31

score 23 · Answer 2 · answered Feb 27 '11 at 05:37

An algebraic way of looking at this is as follows:

From an algebraic view point, we know how to find the extremum of a function of many variables. Say we want to find the extremum of $f(x_1,x_2,\ldots,x_n)$, we set the gradient to zero and look at the definiteness of the Hessian.

We would like to extend this idea, when we want to find the extremum of a function along with some constraints. Say the problem is: $$\begin{align} \text{Minimize }f(x_1,x_2,\ldots,x_n)\\\ \text{subject to: }g_k(x_1,x_2,\ldots,x_n) = 0\\\ \text{where }k \in \{1,2,\ldots,m\}\\\ \end{align} $$

If we find the extremum of $f$ just by setting the gradient of $f$ to zero, these extremum need not satisfy the constraints.

Hence, we would like to include the constraints in the previous idea. One way to it is as follows. Define a new function: $$F(\vec{x},\vec{\lambda}) = f(\vec{x}) - \lambda_1 g_1(\vec{x}) - \lambda_2 g_2(\vec{x}) - \cdots - \lambda_m g_m(\vec{x})$$ where $\vec{x} = \left[ x_1,x_2,\ldots,x_n \right], \vec{\lambda} = \left[\lambda_1,\lambda_2,\ldots,\lambda_m \right]$

Note that when the constraints are enforced, we have $F(\vec{x},\vec{\lambda}) = f(\vec{x})$ since $g_j(x) = 0$ when the constraints are enforced.

Let us find the extremum of $F(\vec{x},\vec{\lambda})$. This is done by setting $\frac{\partial F}{\partial x_i} = 0$ and $\frac{\partial F}{\partial \lambda_j} = 0$ where $i \in \{1,2,\ldots,n\}$ and $j \in \{1,2,\ldots,m\}$

Setting $\frac{\partial F}{\partial x_i} = 0$ gives us $$\vec{\nabla}f = \vec{\nabla}g \cdot \vec{\lambda}$$ where $\vec{\nabla}g = \left[\vec{\nabla} g_1(\vec{x}),\vec{\nabla} g_2(\vec{x}),\ldots,\vec{\nabla} g_m(\vec{x}) \right]$

Setting $\frac{\partial F}{\partial \lambda_j} = 0$ gives us $$g_j(x) = 0$$ where $j \in \{1,2,\ldots,m\}$

Hence, we find that when we find the extremum of $F$, the constraints are automatically enforced. This means that the extremum of $F$ corresponds to extremum of $f$ with the constraints enforced.

To decide, if the extremum is a minimum (or) maximum (or) if the point we obtain by solving the system is a saddle point, we need to look at the definiteness of the Hessian of $F$ and decide.

+1. The approach Sivaram describes here also leads to a notion of duality for nonlinear optimization problems and ultimately to the important Karush-Kuhn-Tucker conditions. — Mike Spivey, Feb 27 '11 at 05:49

Venkata Karthik Bandaru · Answer 3 · 2025-05-09T08:28:58.070

[This is a very heuristic explanation of Lagrange multipliers.]

Consider the optimization problem

$${ {\begin{aligned} &\, \text{minimize } \quad f(x) \\ &\, \text{subject to } \, \, \, \, H(x) = 0 \end{aligned}} }$$

where ${ f : \mathbb{R} ^n \longrightarrow \mathbb{R} }$ and ${ H : \mathbb{R} ^n \longrightarrow \mathbb{R} ^{\ell} }$ are smooth functions.

Suppose at every point in the feasible set ${ \lbrace x : H(x) = 0 \rbrace , }$ the gradients ${ \nabla H _1 (x), \ldots, \nabla H _{\ell} (x) }$ are linearly independent.
Hence by the implicit function theorem, near any point in the feasible set ${ \lbrace x : H(x) = 0 \rbrace , }$ there are ${ \ell }$ coordinates (of the feasible set) which can be expressed as smooth functions of the other ${ (n-\ell) }$ coordinates.

Suppose ${ x ^{\ast} }$ is a local minimizer of ${ f(x) }$ under the constraints ${ H(x) = 0 .}$

Consider perturbations ${ \Delta x , }$ such that ${ x ^{\ast} + \Delta x }$ stays approximately in the feasible set.
Equivalently, consider perturbations ${ \Delta x , }$ which lie in the tangential space to the feasible set ${ \lbrace x : H(x) = 0 \rbrace }$ at ${ x ^{\ast} . }$

For brevity, the tangential space to the feasible set ${ \mathscr{F} = \lbrace x : H(x) = 0 \rbrace }$ at ${ x ^{\ast} }$ is called ${ T _{x ^{\ast}} \mathscr{F} . }$
Note that intuitively ${ T _{x ^{\ast}} \mathscr{F} }$ is ${ (n-\ell) }$ dimensional.

Note that

${ H(x ^{\ast} + \Delta x) \approx 0 }$ for all small ${ \Delta x }$ in ${ T _{x ^{\ast}} \mathscr{F} .}$
${ f(x ^{\ast} + \Delta x) \approx f(x ^{\ast}) }$ for all small ${ \Delta x }$ in ${ T _{x ^{\ast}} \mathscr{F} . }$

Hence

${ \nabla H _1 (x ^{\ast}) ^T \Delta x = 0, \ldots, \nabla H _{\ell} (x ^{\ast}) ^T \Delta x = 0 }$ for all small ${ \Delta x }$ in ${ T _{x ^{\ast}} \mathscr{F} . }$
${ \nabla f(x ^{\ast}) ^T \Delta x = 0 }$ for all small ${ \Delta x }$ in ${ T _{x ^{\ast}} \mathscr{F} . }$

Note that ${ T _{x ^{\ast}} \mathscr{F} }$ is ${ (n - \ell) }$ dimensional, and the ${ \ell }$ linearly independent gradients ${ \nabla H _1 (x ^{\ast}), \ldots, \nabla H _{\ell} (x ^{\ast}) }$ are normal to ${ T _{x ^{\ast}} \mathscr{F} . }$

Hence

${ \nabla H _1 (x ^{\ast}), \ldots, \nabla H _{\ell} (x ^{\ast}) }$ form a basis of ${ (T _{x ^{\ast}} \mathscr{F}) ^{\perp} .}$
${ \nabla f(x ^{\ast}) \in (T _{x ^{\ast}} \mathscr{F}) ^{\perp} . }$

Hence there exist unique ${ \lambda _1 ^{\ast}, \ldots, \lambda _{\ell} ^{\ast} \in \mathbb{R} }$ such that

$${ \nabla f(x ^{\ast}) = \sum _{i=1} ^{\ell} \lambda _i ^{\ast} \nabla H _i (x ^{\ast}) .}$$

The ${ \lambda _i ^{\ast} }$s are called Lagrange multipliers.

Note that the necessary conditions for ${ x ^{\ast} }$ being a local minimizer

$${ {\begin{cases} \, \nabla f(x ^{\ast}) = \sum _{i=1} ^{\ell} \lambda _i ^{\ast} \nabla H _i (x ^{\ast}), \\ \, H(x ^{\ast}) = 0 \end{cases}} }$$

can be re expressed as:

The point ${ (x ^{\ast}, - \lambda ^{\ast}) }$ is a critical point of

$${ {\begin{aligned} &\, \mathcal{L} : \mathbb{R} ^n \times \mathbb{R} ^{\ell} \longrightarrow \mathbb{R}, \\ &\, \mathcal{L} (x, \lambda) := f(x) + \lambda ^T H(x). \end{aligned}} }$$

The function ${ \mathcal{L} }$ is called the Lagrangian.

Hence the critical points of the Lagrangian give the potential candidates for the local minimizers of the constrained optimization problem.

Derivation of the method of Lagrange multipliers?

3 Answers3

Linked

Related