Given a scalar function $f$ of several variables $x_1,\dots,x_n$, the differential is defined by
$$\mathrm df=\frac{\partial f}{\partial x_1}\mathrm dx_1+\dots+\frac{\partial f}{\partial x_n}\mathrm dx_n=\sum_i\frac{\partial f}{\partial x_i}\mathrm dx_i$$
This is a linear form in disguise, that can also be written
$$\mathrm df(h)=\sum_i a_ih_i$$
With $a_i=\dfrac{\partial f}{\partial x_i}$.
Now, the differential of a scalar function $f$ of the matrix $X$ (with dimensions $n\times p$) is just a linear form, in the variables $x_{ij}$. You can write
$$\mathrm df=\sum_{ij}\frac{\partial f}{\partial x_{ij}}\mathrm dx_{ij}$$
Or
$$\mathrm df(h)=\sum_{ij}\frac{\partial f}{\partial x_{ij}}h_{ij}$$
Since it's a linear form, it's possible to write $\mathrm df$ as a scalar product of two column vectors:
$$\left(\begin{matrix}
\dfrac{\partial f}{\partial x_{11}}\\
\vdots\\
\dfrac{\partial f}{\partial x_{1p}}\\
\dfrac{\partial f}{\partial x_{21}}\\
\vdots\\
\dfrac{\partial f}{\partial x_{2p}}\\
\vdots\\
\dfrac{\partial f}{\partial x_{np}}\\
\end{matrix}\right)
\;\;\;\mathrm{ and }\;\;\;
\left(\begin{matrix}
h_{11}\\
\vdots\\
h_{1p}\\
h_{21}\\
\vdots\\
h_{2p}\\
\vdots\\
h_{np}
\end{matrix}\right)
$$
There is a more compact way.
First, notice that given two matrices $X,Y$ with same dimensions $n\times p$,
$$\mathrm{tr}(X^TY)=\sum_{ij}x_{ij}y_{ij}$$
You can prove that by writing the general term $a_{ij}$ of the product, which has dimensions $p\times p$:
$$a_{ij}=\sum_{k=1}^n x_{ki}y_{kj}$$
Hence
$$a_{ii}=\sum_{k=1}^n x_{ki}y_{ki}=\sum_{k=1}^n x_{ki}y_{ki}$$
$$\sum_{j=1}^p a_{jj}=\sum_{j=1}^p\sum_{i=1}^n x_{ij}y_{ij}$$
That is, $\mathrm{tr}(X^TY)$ is the sum of term-by-term products of all elements of $X$ and $Y$. That's very similar to a scalar product.
Now we can write $\mathrm{d}f$ in the more compact form
$$\mathrm{d}f(H)=\mathrm{tr}(A^TH)$$
With
$$A=\left(\begin{matrix}
\dfrac{\partial f}{\partial x_{11}}&\cdots&\dfrac{\partial f}{\partial x_{1p}}\\
\vdots&\ddots&\vdots\\
\dfrac{\partial f}{\partial x_{n1}}&\cdots&\dfrac{\partial f}{\partial
x_{np}}
\end{matrix}\right)$$
The differential of a scalar function of a matrix can always be written in this compact form, and $A$ is unique. By convention it's this $A$ we will call $\dfrac{\mathrm df}{\mathrm dX}$ (same convention as in the Matrix Cookbook). There is another convention, where the matrix derivative is instead $A^T$.
To find $A$, you can compute the partial derivatives, or use the Taylor formula for a function of several variables, at order $1$:
$$f(X+H)=f(X)+\mathrm df(H)+o(||H||)$$
Note it's a little-$o$, not a big-$O$ ($\mathrm df(H)$ is already a $O(||H||)$, so that would be meaningless with a big-$O$)), and it's the norm of $H$ inside.
Let's apply this to $f(X)=\mathrm{tr}(X^TX)$. Using the partial derivatives, and noticing that $f(X)=\sum_{ij} x_{ij}^2$, you get at once
$$\dfrac{\mathrm df}{\mathrm dX}=2X$$
Since $\mathrm{tr}(X^TX)=\mathrm{tr}(XX^T)$, the derivative is the same (it's the same function):
$$\dfrac{\mathrm d\left(\mathrm{tr}(XX^T)\right)}{\mathrm dX}=2X$$
Or with Taylor's formula
$$f(X+H)=\mathrm{tr}\left((X+H)^T(X+H)\right)=f(X)+\mathrm{tr}(X^TH)+\mathrm{tr}(XH^T)+o(||H||)$$
But $\mathrm{tr}(XH^T)=\mathrm{tr}(X^TH)$, so
$$f(X+H)=f(X)+\mathrm{tr}(2X^TH)+o(||H||)$$
And you identify the formula $\mathrm{tr}(A^TX)$, which gives you the derivative $A$. Or you can leave this as a differential
$$\mathrm df(H)=\mathrm{tr}(2X^TH)$$
A few other examples
The trace of the square of $X$ is given by:
$$\mathrm{tr}(X^2)=\sum_i x_{ii}^2+2\sum_{i<j}x_{ij}x_{ji}$$
By computing the partial derivatives or by using Taylor's formula, you find
$$\frac{\mathrm d\left(\mathrm{tr}(X^2)\right)}{\mathrm dX}=2X^T$$
The function $f(X)=\mathrm{tr}(A^TX)$ has derivative
$$\frac{\mathrm d\left(\mathrm{tr}(A^TX)\right)}{\mathrm dX}=A$$
By using an expansion of the determinant of $X$, you can prove that
$$\frac{\mathrm d(\det X)}{\mathrm dX}=\mathrm{Com}\,X$$
Where $\mathrm{Com}\, X$ is the comatrix of $X$.
See also the Wikipedia article on matrix calculus.