What are the hidden assumptions in this interpretation?
There are no hidden assumptions. Here is how to do this calculation: if $A$ is an invertible matrix, then for $\varepsilon$ a sufficiently small matrix (with respect to any matrix norm), the perturbation $A + \varepsilon$ remains invertible, and
$$\begin{align*} (A + \varepsilon)^{-1} &= ((I + \varepsilon A^{-1}) A)^{-1} \\
&= A^{-1} (I + \varepsilon A^{-1})^{-1} \\
&= A^{-1} (I - \varepsilon A^{-1} + O(\varepsilon^2)) \\
&= A^{-1} - A^{-1} \varepsilon A^{-1} + O(\varepsilon^2) \end{align*} $$
so we get that, as desired, the Frechet derivative is the linear map $\varepsilon \mapsto - A^{-1} \varepsilon A^{-1}$. This argument is valid with respect to any matrix norm. (Strictly speaking we need a small argument involving the convergence of a geometric series to show that that $O(\varepsilon^2$) is justified but this is a standard von Neumann series argument.)
More generally, the derivative of a linear map between finite-dimensional vector spaces (including spaces of matrices) can be computed as a linear approximation with respect to any choice of norms and does not depend on that choice.
it is assumed that the linear approximation uses the Frobenius/standard inner product (trace). Is this the only possible interpreation? How can I derive it myself?
I don't see where the Frobenius inner product is used in the linked post to define what a linear approximation is. It is used to write down linear maps; e.g. if one wants to differentiate a scalar-valued matrix function $f : M_n \to \mathbb{R}$ the result is a linear functional on $M_n$ and any such linear functional can be written as $\varepsilon \mapsto \text{tr}(A \varepsilon)$ for a unique matrix $A$, so it's convenient to describe such derivatives by identifying them with the corresponding matrices $A$. I'm under the impression this is a standard convention in many places that is used without comment. (There's also a further question of whether we should take the transpose of that matrix or not but this gets into annoying issues like the difference between a derivative and a gradient.)
In the questions above we may define the (undefined) domain of $f$ to be any vector space that contains the matrix $A$ (symmetric matrices, kenels..), and we also have many options for the image vector space $W$.
None of those choices affect the Frechet derivative if it's calculated correctly. The paper you link is disturbing but it also clearly explains how carefully applying the definition of the Frechet derivative solves everything. The confusion is, among other things, about gradients, which involve a choice of inner product and which depend on that choice, and also about how these choices interact with passing to subspaces such as symmetric matrices.
More trivially, in the books they write that the (Frechet) derivative of $2x$ is $2$, but such a derivative must be a linear function. They of course mean that the derivative of $f:V\to W$ that maps $x$ to $2x$ is the linear function $f':V\to W$ that maps every $x$ to $2$ where $V=W=R$. Can't we define different subset $V$ of $R$ with different product operator (still norm) that would yield another answer? Say, binary numbers and operators?
The function which maps $x$ to $2$ is not linear. Every linear function is its own Frechet derivative; "$2$" is shorthand for the linear map $x \mapsto 2x$. This calculation is valid for the map $x \mapsto 2x$ on any normed vector space. I don't know what you mean by "different product operator."
about gradients, which involve a choice of inner product and which depend on that choice, and also about how these choices interact with passing to subspaces such as symmetric matrices.
– Dan Feldman Sep 24 '22 at 16:31