8

In "Efficient Backprop" (http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf), LeCun and others propose a modified tanh activation function of the form:

$$ f(x) = 1.7159 * tanh(\frac{2}{3}*x) $$

They argue that :

  • It is easier to approximate with polynomials
  • It is said that it fit the target that it's second derivative is maximal in 1

I tried to start with a function of the form : $f(x) = a * tanh(b*x)$ and derive the value of $a$ and $b$ to match the aforementionned properties.

Any idea of how those constants are derived ? Under what assumptions ? Does it match its expected properties by construction ?

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
Lucas Morin
  • 2,775
  • 5
  • 25
  • 47

2 Answers2

3

In "Generalization and Network Design Strategies", LeCun argues that he chose parameters that satisfy: $$ f (\pm1) = \pm1$$

The rationale behind this is that the overall gain of the squashing transformation is around 1 in normal operating conditions, and the interpretation of the state of the network is simplified. Moreover, the absolute value of the second derivative of $f$ is a maximum at $+1$ and $-1$, which improves the convergence at the end of the learning session.

Visualization of the derivatives using this code:

Behavior of Tanh_lecun

From his description of the maxima of the second derivative I conclude that the third differentiation of $f(x)=a*tanh(bx)$ should be zero for $ x = \pm1$. The third differentiation is: $$\frac{\partial^{3} f}{\partial x^3}=\frac{-2ab³}{cosh²(bx)}\left(\frac{1}{cosh²(bx)} - 2tanh²(bx)\right) $$

So we set it to zero: $$1-2sinh²(bx)=0 \quad x\in\pm1$$ $$bx=arcsinh(\frac{1}{\sqrt{2}})$$ Plugging the values into numpy I get:

$$b=0.6584789484624083$$ Plugging the result into $f(1)$ I get: $$a=1.7320509044937022$$

This means there is a slight difference in values between my variables and his. Comparing the the tanh using our different variable values I get $\delta = 0.0012567267661376946$ for $x=1$ using my numpy code.

Either I made a mistake, he did not have such an accurate numerical solver/lookup table, or he chose a "nicer looking" number.

a-doering
  • 156
  • 6
2

I once did a drive to derive a symbolic solution (without trigonometry functions) for myself (mostly relying on Wolfram Alpha to do the heavy lifting) using the same constraints as @a-doering ($f(\pm1)=\pm1$ and $f'''(\pm1)=0$). I arrived at the fairly – all things considered – nice looking coefficients:

$\begin{align} f(x) &= a \tanh(bx) \\ a &= \sqrt{3} &&\approx 1.732050808 &&\approx 1.7159 \\ b &= \frac{-\ln(2 - \sqrt{3})}{2} &&\approx 0.658478948 &&\approx \frac{2}{3} \end{align}$

Unfortunately, I do not remember the steps I took to get there.

Here's an interactive graph for anyone interested in playing around with it: https://www.desmos.com/calculator/tf4udjl8cn Original in red, mine in green. I doubt there would be any real difference in training outcome between them.

Addendum

I went back and rederived it, so here goes.

Start by deriving $b$ by setting $x=1$ and calculating $\frac{d^3}{dx^3} \tanh(b) = 0$ (the $a$-factor has no impact on the third derivative). WolframAlpha gives $b = \pm\frac{1}{2}\cosh^{-1}(2) = \pm\frac{1}{2}\ln(2+\sqrt{3})$.

Since setting $b=\frac{1}{2}\ln(2+\sqrt{3})$ guarantees that the third-derivative-condition will be met when $x=1$, we can lock both $x$ and $b$, and scale the output to fulfill the $f(1) = 1$ condition. Trusting in WolframAlpha, we have:

\begin{align} a \tanh\left(\frac{\ln(2+\sqrt{3})}{2}\right) &= 1 \\ a &= \frac{1}{\tanh\left(\frac{\ln(2+\sqrt{3})}{2}\right)} \\ &= \sqrt{3} \end{align}

To see the equivalence with my original answer, note that $\ln(2+\sqrt{3}) = -\ln(2-\sqrt{3})$. The full function thus becomes: \begin{align} f(x) &= \sqrt{3} \tanh\left( \frac{\ln(2 + \sqrt{3})}{2} x \right) \end{align}

masaers
  • 141
  • 4