How is it justified to apply the chain rule to a function when the inputs themselves are functions?

Question

Given $u(x,v)$, $v(x,y)$, and $f(u,v)$ ($u$ is a function of $x$ and $v$, and $v$ itself is a function of $x$ and $y$), we want to find ${\partial f}/{\partial x}.$ I've seen this done as:

${\partial f}/{\partial x} = {\partial f}/{\partial u} \cdot {\partial u}/{\partial x} + {\partial f}/{\partial v} \cdot {\partial v}/{\partial x}$.

${\partial u}/{\partial x}$ is then found as if $v$ was a constant (not a function of $x$).

However, this seems wrong to me. The chain rule allows us to separate variables, but how does it allow us to treat functions of the variable we are differentiating against as constants?

Nonetheless, this is a common approach in many CS papers, especially concerning machine learning and neural networks. Backpropogation, a common ML/NN algorithm, seems to rely on it. For a very clear example, see the derivation of ${\partial l}/{\partial x_i}$ in http://costapt.github.io/2016/07/09/batch-norm-alt/ .

What is the proof or basis to treat $v$ as a constant when taking the partial derivative with respect to $u$?

I am always confused by the same thing in physics with the Lagrangian. They take a partial derivative with respect to $x$ and another partial derivative with respect to $\dot{x}$. — Bob Krueger, Jul 14 '17 at 12:49

score 2 · Accepted Answer · answered Jul 11 '17 at 20:23

The approach above is common in applications, due to its convenience, but rare in mathematics, due to its sloppiness. Specifically, $f$ is poorly defined: Is it a function $\mathbb{RxR} \to \mathbb{R}$? If so, what does it mean to constrain the second parameter as a function of the first?

What we really want is: Given $f(x,y), a(x), b(x)$, define $g(x): \mathbb{R} \to \mathbb{R}, x \mapsto f(a(x), b(x))$.

This proper notation highlights that, in the sloppy original notation, ${\partial f}/{\partial q}$ could have two meanings (!), and you had to figure it out based on context: either the slope of $f$ if only the first param change, and we could somehow hold the second param constant; or the slope of $f$ if the first param changes and the second param changes accordingly. Our new notation fixes that, because $f$ is an arity-2 function, and g is an arity-1 function, and both can take any args in their domain (as a function should!).

Continuing the notational upgrade, we use Spivak's $D$ operator, such that if $f: \mathbb{R^m} \to \mathbb{R^n}$, $Df$ is also a function from $\mathbb{R^m} \to \mathbb{R^n}$. (In this notation, partial derivatives are simply selecting a single entry from the derivative, and the gradient, Jacobian, and ordinary derivative are all the same thing, with the name changed if $n$ or $m$ equal 1.) So we can write a single function $c: x \mapsto (a(x), b(x))$, and use the multivariable chain rule $Dg(x) = D(f \circ c)(x) = Df(c(x)) \cdot Dc(x)$.

Multiply the dot product out and convert back to the original (~~sloppy~~applied) notation, and you get $df/dx = {\partial f}/{\partial u} \cdot {\partial u}/{\partial x} + {\partial f}/{\partial v} \cdot {\partial v}/{\partial x}$. (Note that, here too, on the left side we are treating $f$ as a function of a single variable $x$, on the right side as a function of two variables $u$ and $v$ -- you have to tell from the context which one is meant!)

In short, the problem arises because, unlike Spivak notation, which only defines the derivative with respect to the parameters of the function, Leibniz notation allows you take the derivative with respect to anything, and, if there are multiple anythings in the same expression, it's unstated how those anythings are interrelated. On the left side, $dx$ means "assuming $x$ changes and causes $u$ and $v$ to change", whereas on the right side, ${\partial u}$ means "assuming $u$ changes with $v$ not changing". The problem would be even more glaring if the question had been posed as "What is the total derivative of $f(x, u(x))$ w.r.t $x$?", since the answer is then $df/dx = {\partial f}/{\partial x} + {\partial f}/{\partial u} \cdot du/dx$ (!!!).

SRobertJames · Answer 2 · 2017-07-11T01:18:09.783

Total derivative is the name for doing exactly that (see https://en.wikipedia.org/wiki/Total_derivative ). It's explored at length What exactly is the difference between a derivative and a total derivative? and at https://spin0r.wordpress.com/2013/01/04/the-difference-between-partial-and-total-derivatives/ .

See also https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version

I'm shocked, however, that it seems to be missing from standard math texts. E.g. the above Wikipedia article cites Fundamental Methods of Mathematical Economics for Total Derivative, not any standard calc text. Likewise, https://ocw.mit.edu/courses/mathematics/18-02-multivariable-calculus-fall-2007/readings/non_ind_variable.pdf touches on the problems here, but does not introduce the concept of Total Derivative.

I'd like to see a few things:

A proof showing that the Total Derivative can be found this way
Some type of analytical discussion of the Total Derivative (similar to how ordinary and partial derivatives are defined and explored)
The conditions required for the Total Derivative to be defined
A reference to it any math text (not a economics or thermodynamics or CS text) or even paper

How is it justified to apply the chain rule to a function when the inputs themselves are functions?

2 Answers2

Linked