Resources
I have tried to give an answer where the definitions are chosen such that they will be compatible with as many fields I could think of. For example one may choose to define the differential to not necessarily be linear, but I have not taken this approach. Most of what I have written is based on various resources that I have listed below, but I emphasize once again that the definitions in these resources may clash with other uses of differential in the literature.
I highly recommend looking at the presentation on "Differentiation in Linear Spaces" by Simovici - especially for the definition of differential. He also has examples in the second part of the presentation. For discussion of terminology and definitions around Fréchet and Gateaux derivatives and variations I have found this appendix on "Differentiation in Abstract Spaces" by Tapia to be very nice (jump to figure 7.1 for a visual summary). For counterexamples of Gateaux differentiable functions that are not Fréchet differentiable see the nice figures in this handout on calculus of variations by Slastikov and Kitavtsev. You can also look at "Introduction of Fréchet and Gateaux Derivative" by Bemardi and Enyari and the section on generalized derivatives in Jahn's book "Introduction to the Theory of Nonlinear Optimization".
Total/Fréchet derivative for $f:\mathbb{R}^n\to\mathbb{R}^m$
The (total) derivative of a function $f:\mathbb{R}^n\to\mathbb{R}^m$ at point $x_0\in \mathbb{R}^n$, is defined as the (bounded) linear map $L$ such that
$$\lim_{v\to 0}\frac{\|f(x_0+v)-f(x_0)-L(v)\|}{\|v\|} = 0.$$
If there exists an $L$ satisfying the above, we say that $f$ is differentiable at $x_0$, and one can show that the derivative $L$ is unique (i.e. there is no $L'\ne L$ such that it also satisfies the above).
One typically uses the notation $D_{x_0}f, Df(x_0), d_{x_0}f, df(x_0), f'_{x_0}, f'(x_0)$ instead of $L$. The function-like notation is intentional as will become evident when I define the differential in a subsection below. The above definition of the total derivative is equivalent to there existing a linear map $df(x_0)$ such that
$$f(x_0+v) = f(x_0) + df(x_0)(v) + R(x_0,v), \quad \lim_{v\to 0}\frac{\|R(x,v)\|}{\|v\|} = 0.$$
That is, you can use it in the Taylor expansion.
Moreover, whenever the total derivative exists it agrees with the directional derivatives
$$Df(x_0)(v) = \partial_v f(x_0) := \lim_{\epsilon\to 0}\frac{f(x_0+\epsilon v)-f(x_0)}{\epsilon}.$$
Note that the directional derivatives existing does not guarantee that the total derivative exists.
The differential
Suppose that the total derivative of $f:\mathbb{R}^n\to\mathbb{R}^m$ exists on a subset $X\subseteq\mathbb{R}^n$. Then you can define the differential (see slide 6 of Simovici's prensentation for this definition) as the function $\delta f : X \times \mathbb{R}^n \to \mathbb{R}^m$ such that $\delta f(x_0; v) = df(x_0)(v)$. Note that the total derivative of $f$ at $x_0$ is a linear map $df(x_0):\mathbb{R}^n\to\mathbb{R}^n$, while the differential is a function of two arguments $\delta f : X \times \mathbb{R}^n \to \mathbb{R}^m$ that is linear in its second argument. For convenience one often reuses the notation of the derivative with an omitted point evaluation argument $df, Df, f'$ for the differential $\delta f$. Then $df =\delta f$, i.e., $df: X\times \mathbb{R}^n\to\mathbb{R}^m$. You can think of this as producing the derivative $df(x_0)$ from the differential $df$ through currying.
On a separate note - the notation $\delta f$ is often reserved for the Gateaux variation, so I would really use $df, Df, f'$ unlike Simovici's $\delta f$.
Relation to the derivative for $f:\mathbb{R}\to\mathbb{R}$
In single variable calculus one defines the derivative at $x_0$ as
$$\frac{df}{dx}(x_0) := \lim_{v\to 0} \frac{f(x_0+v)-f(x_0)}{v}.$$
The above says that for any $\epsilon>0$ we can find a $\delta(\epsilon)>0$ such that for $|v|<\delta$ we have
$$|(f(x_0+v)-f(x_0))/v-\frac{df}{dx}(x_0)|<\epsilon.$$
But then this is also equivalent to
$$\lim_{v\to 0} \frac{|f(x_0+v)-f(x_0)-\frac{df}{dx}(x_0)v|}{|v|} = 0.$$
In other words $df(x_0)(v) = \frac{df}{dx}(x_0)\cdot v$. Note that $df(x_0)$ is the total derivative, which is a linear map, while what we call the derivative in basic calculus, i.e. $df/dx(x_0) = df(x_0)(1)$, is really the coordinate represetation of $df(x_0):\mathbb{R}\to\mathbb{R}$, similar to how matrices are coordinate representations of linear maps (see my section on the Jacobian below). I suppose that one uses the term "derivative" for both the coordinate representation and the map, because in $\mathbb{R}$ it doesn't matter too much as you have a standard basis. It's similar to how one informally refers to a matrix $M:\mathbb{R}^{m\times n}$ as a linear map, even though it's really $M\cdot :\mathbb{R}^n\to\mathbb{R}^m$ that is the linear map ($\cdot$ here being matrix-vector multiplication). If you go to abstract vector spaces where there is no canonical choice of basis, I would argue that it's better to reserve the term derivative for the linear map, as the Jacobian depends on the choice of basis, while the definition of $df(x_0)$ does not depend on the choice of basis (you find an elaboration of this argument in the introduction of Tapia's "Differentiation in Abstract Spaces").
The Jacobian
In finite-dimensional spaces any linear map $L:U\to V$ between vector spaces $U$ and $V$, can be written as a matrix with respect to specific bases of $U$ and $V$. Suppose that $A=[a_1,\ldots,a_n] \in U^{1\times n}$ is a basis for $U$ and $B=[b_1,\ldots,b_m] \in V^{1\times m}$ is a basis for $V$. The canonical dual basis for the continuous dual space $V^*$ corresponding to $V$ is $b^1,\ldots,b^m: V \to\mathbb{R}$ satisfying the biorthogonality condition $b^i(b_j) = \delta^i_j$. Then the coordinate representation of $L$ wrt the two bases is given as
$$([L]^A_B)^i_j = b^i(L(a_j)) \implies [L]^A_B\in\mathbb{F}^{m\times n}.$$
This is just a matrix, but note that it depends on the choice of bases.
If you take $U=\mathbb{R}^n$ and $V=\mathbb{R}^m$ one usually chooses the standard basis $a_j = e_j$ and $b^i(w) = e^i(w) = e_i^T\cdot w = w^i$. So now suppose you have the derivative $df(x_0):\mathbb{R}^n\to\mathbb{R}^m$, then its coordinate representation w.r.t. the standard basis is
$$([df(x_0)]^A_B)^i_j = e^i(df(x_0)(e_j)) = e^i(\partial_{e_j} f(x_0)) = (\partial_{e_j} f(x_0))^i = \partial_{e_j} f^i(x_0).$$
This is the Jacobian matrix $Jf(x_0)=[df(x_0)]^A_B$ at $x_0$. To make this even clearer you can write the above as
\begin{align}
f &= \begin{bmatrix} f^1 \\ \vdots \\ f^m\end{bmatrix} : \mathbb{R}^n\to\mathbb{R}^m \\
Jf(x_0) &= \begin{bmatrix}
\partial_{e_1}f^1(x_0) & \ldots & \partial_{e_n} f^1(x_0) \\
\vdots & & \vdots \\
\partial_{e_1}f^m(x_0) & \ldots & \partial_{e_n} f^m(x_0)
\end{bmatrix} \implies df(x_0)(v) = J_f(x_0)\cdot v.
\end{align}
You could also write $df(x_0) = \sum_{j=1}^n \partial_{e_j}f(x_0) e^j$ - remember that $e^j:\mathbb{R}^n\to\mathbb{R}$, so this is indeed a function from $\mathbb{R}^n$ to $\mathbb{R}^m$. This way of writing it makes it obvious that it is related to the exterior derivative from differential geometry, although there they use $dx^i$ for $e^i$ and $\frac{\partial f}{\partial x^j}(x_0)$ for $\partial_{e_j}f(x_0)$, that is
$$df(x_0) = Jf(x_0) \cdot dx = \sum_{j=1}^n \frac{\partial f}{\partial x^i}(x_0)dx^i \implies df = \sum_{j=1}^n \frac{\partial f}{\partial x^i}dx^i.$$
If your spaces were not $U=\mathbb{R}^m$ and $V=\mathbb{R}^n$, and if you didn't have a standard/canonical choice of basis, then it may not be as obvious wrt which bases you have defined the Jacobian so you might have to specify that as $[df(x_0)]^A_B$ instead of just writing $Jf(x_0)$. When you see $Jf(x_0)$ it is defined w.r.t. some bases which should be clear from the context where this appears. The point is that $df(x_0)$ is the primary notion, not $Jf(x_0)$, the latter is just a coordinate representation of the linear map. Of course, for computations, you usually start by computing $Jf(x_0)$ and do not care that $df(x_0)$ is the primary notion, as you typically work in some basis.
Generalizations: Fréchet and Gateaux derivatives
You can take $f:X\to T$ where $(S,\|\cdot\|_S)$ and $(T,\|\cdot\|_T)$ are normed vector spaces (they can be infinite dimensional too - then consider Banach spaces), and $X$ is an open subset of $S$. Then the Fréchet derivative $df(x_0): X \to T$ of $f$ at $x_0$ is the bounded linear map that satisfies
$$\lim_{v\to 0}\frac{\|f(x_0+v)-f(x_0)-df(x_0)(v)\|_T}{\|v\|_S} = 0.$$
Equivalently
$$f(x_0+v) = f(x_0) + df(x_0)(v) + R(x_0,v), \quad \lim_{v\to 0}\frac{\|R(x_0,v)\|_T}{\|v\|_S} = 0.$$
So the total derivative is the Fréchet derivative for $S=\mathbb{R}^n$ and $T=\mathbb{R}^m$ with the Euclidean norms $\|\cdot\|_S=\|\cdot\|_2$ and $\|\cdot\|_T=\|\cdot\|_2$.
There's also a Gateaux derivative at $x_0$ defined as the bounded linear map $df_G(x_0)$ such that for any $v\in X$
$$df_G(x_0)(v) = \lim_{\epsilon\to 0}\frac{f(x_0+\epsilon v)-f(x_0)}{\epsilon}.$$
Equivalently you can write
$$f(x_0+v) = f(x_0) + df_G(x_0)(v) + R(x_0,v), \quad \lim_{\epsilon\to 0}\frac{R(x_0,\epsilon v)}{\epsilon} = 0.$$
It's a weaker derivative than the Fréchet derivative in the sense that we care only about convergence along lines, while for the Fréchet derivative we require convergence along any path. This means that any Fréchet derivative is also a Gateaux derivative, but the converse is not necessarily true. You can find various examples illustrating the differences between the two in the figures in this handout on calculus of variations by Slastikov and Kitavtsev. Typically one does not write $df_G$ and rather just writes $df$, so whether $df$ refers to the Gateaux or Fréchet derivative should typically be deduced from the context.
The Gateaux derivative is quite useful for example in the calculus of variations. Note also that there are conflicting definitions of Gateaux differentiability in the literature, where for example linearity may not be required. Personally here I used the definitions from the presentation by Simovici "Differentiation in Linear Spaces" and Tapia's "Differentiation in Abstract Spaces". As in Tapia's treatment I would rather reserve the word variation for the setting where the "derivative" is not necessarily linear or bounded.
The gradient $\nabla f$ for $f:\mathbb{R}^n\to\mathbb{R}$
While in the above definitions the Fréchet and Gateaux derivatives required normed vector spaces (in fact a topological vector space is sufficient for a Gateaux variation), the definition of the gradient requires an inner product. Then you can define the (Gateaux) gradient $\nabla f(x_0)$ as the element that satisfies $df(x_0)(v) = \langle \nabla f(x_0), v\rangle$ where $\langle\cdot,\cdot\rangle$ is the inner product and $df(x_0)$ is the Gateaux derivative. You will notice a notational clash if you read the article on the (Gateaux) gradient in the encyclopedia of math and in Tapia's treatement. I would prefer to stick to $\nabla f$ for gradient, and reserve $f'$ for derivatives. In either case, the gradient technically depends on your choice of inner product. For $f:\mathbb{R}^n\to\mathbb{R}$ with respect to the standard dot product it simply becomes
$$\nabla f(x_0) = \begin{bmatrix} \partial_{e_1} f(x_0) \\ \vdots \\ \partial_{e_n} f(x_0) \end{bmatrix}.$$
But if I were to define an inner product such that the Gramian is $G_{ij} = \langle e_i, e_j\rangle$ (i.e. $\langle u,v\rangle = u^TGv$), then the gradient w.r.t. this inner product is given as
$$\nabla_G f(x_0) = G^{-1}\begin{bmatrix} \partial_{e_1} f(x_0) \\ \vdots \\ \partial_{e_n} f(x_0) \end{bmatrix} = G^{-1}(Jf(x_0))^{T}.$$
You can verify that with this definition $df(x_0)(v) = \langle \nabla_G f(x_0), v\rangle$.
Note that the gradient is a vector, while the derivative is a linear map - this is often confused, and you can even find misleading answers on physics.stack and math.stack that conflate the two.
A more interesting example is to consider the Gateaux gradient of the functional $E(f) = \frac{1}{2}\int \|\nabla f\|^2$, w.r.t. the standard inner product $\langle f, g\rangle = \int f g$. You can show that it is $\nabla E(f) = -\Delta f$, where $\Delta$ is the Laplacian.
The gradients $\nabla f^i$ for $f:\mathbb{R}^n\to\mathbb{R}^m$
If you have a vector-valued function $f:\mathbb{R}^n\to\mathbb{R}^m$ you could define gradients for each component:
\begin{equation}
f = \begin{bmatrix} f^1 \\ \vdots \\ f^m\end{bmatrix} \implies
df^i(x_0) (v) = \langle \nabla_G f^i(x_0), v\rangle \implies
\nabla_G f^i = G^{-1}\begin{bmatrix} \partial_{e_1} f^i(x_0) \\ \vdots \\ \partial_{e_n} f^i(x_0) \end{bmatrix}.
\end{equation}
Relation to exterior derivative and differential forms
Let $f:\mathbb{R}^n\to\mathbb{R}$, then the exterior derivative of $f$ is
$df = \sum_{j=1}^n \partial_{e_i} f e^i$. This is precisely the differential $df$ of $f$, which happens to be a one form (field of linear functionals). In differential geometry one often writes $dx^i$ for the $e^i$ and $\frac{\partial f}{\partial x^i}$ for $\partial_{e_i} f$. This is the case because $f$ is typically defined over some manifold $M$ and you have coordinate functions $x^i:U\subseteq M\to\mathbb{R}$. Then $\frac{\partial}{\partial x^j}|_{p}$ form a basis for the tangent space $T_pM$ of $M$ at $p$, and $dx^i$ is the canonical dual basis for the dual space $(T_pM)^*$ such that $dx^i(\frac{\partial}{\partial x^j}|_{p}) = \delta^i_j$. In exterior calculus the exterior derivative is also defined for higher order forms, however. That is, one may consider fields of antisymmetric $k$-linear maps, i.e. differential $k$-forms, and define the exterior derivative to produce a $(k+1)$-form.