The idea is to show that you can find a basis consisting of vectors that are eigenvectors of both $A$ and $B$. Then a proof goes by induction on the dimension of the space (or the size of the matrices, if you prefer that). The key observation is the following.
Let $V$ be the whole space ($\Bbb{C}^n$ or $\Bbb{R}^n$, depending).
Let $\lambda$ be an eigenvalue of $A$. Consider the corresponding eigenspace $V_\lambda$. Then it follows that $B(V_\lambda)\subseteq V_\lambda$. This is because for all $x\in V_\lambda$ we have
$$
A(Bx)=(AB)x=(BA)x=B(Ax)=B(\lambda x)=\lambda (Bx)
$$
proving that $Bx\in V_\lambda$.
This holds for all eigenvalues of $A$. If there is more than one eigenspace, then they all have dimensions $<\dim V$, and induction hypothesis kicks in: by the above observation it is enough to settle the question for all those smaller spaces as by diagnoalizablity of $A$ the whole space is a direct sum of $V_\lambda$:s.
OTOH, if one of the $V_\lambda$:s is the whole space, then $A$ is a scalar matrix, and thus diagonalized by any matrix $S$. In that case it suffices to simply diagonalize $B$.
The base case of $1\times 1$ matrices is trivial.
[Edit]
What seems to be missing from the above is that the subspace $V_\lambda$ also has a basis consisting of eigenvectors of $B$. This can be shown as follows. Diagonalizability of $A$ means that
$$
V=V_\lambda\oplus\left(\bigoplus_{\mu\neq\lambda}V_\mu\right)
$$
is a sum of eigenspaces of $A$. Call that other summand $V_{\neq\lambda}$. Both $V_\lambda$ and $V_{\neq\lambda}$ are stable under $B$, because the above argument also shows that $B(V_\mu)\subseteq V_\mu$ for all $\mu$. If $\beta$ is any eigenvalue of $B$, and $U_\beta$ is the corresponding eigenspace, then any vector $y\in U_\beta$
can be uniquely written in the form $y=y_1+y_2$ with $y_1\in V_\lambda$, $y_2\in V_{\neq\lambda}$. Here $By=\beta y=(\beta y_1)+(\beta y_2)$. But as $By_1\in V_\lambda$ and $By_2\in V_{\neq\lambda}$ we must have $By= By_1+By_2$. By the direct sum property we can conclude that $By_1=\beta y_1$ and $By_2=\beta y_2$. Therefore
$$
U_\beta=(U_\beta\cap V_\lambda)\oplus (U_\beta\cap V_{\neq\lambda}).
$$
The claim follows from this.
[\Edit].