Explanation for MATLAB floating point number calculation?

Question

I am a beginner studying scientific computation, more specifically floating point numbers and precision in matlab. When testing the outputs of 2 of the following equations, I am not sure how matlab computed the results and why it is computed this way.

    y = @(x) (exp(x)-1-x)/x^2; 
    z = @(x) (exp(x)-x-1)/x^2; 
    x = 1e-10; 
    fprintf(’x=%.5e\n  y=%.5e\n  z=%.5e\n’, x, y(x), z(x))

Result:
x=1.00000e-10
y=8.27404e+02
z=0.00000e+00

As you can see, the only difference between y and z is that the numerator's order placement changed from $\exp(x)-1-x$ to $\exp(x)-x-1$. Is subtraction not being associative the reason why the result of y and z are different?
Why is z's result exactly 0 instead of a precise number? Is it because matlab's double precision arithmetic rounded it to 0?

Any clarification would be appreciated.

The the sum of floating point numbers is not associative. For just two sums the error would only be small, but the division by $x^2$ is magnifying it. — plop, Oct 02 '22 at 16:33
@user85667 +1, yes, because of roundoff and truncation errors — gt6989b, Oct 02 '22 at 16:33
Other issues with floating point numbers in Matlab: https://fr.mathworks.com/matlabcentral/answers/69-why-does-1-2-3-1-3-not-equal-zero — Jean Marie, Oct 02 '22 at 16:56

David K · Answer 1 · 2022-10-02T18:25:14.007

First let's see what the answer should have been if all calculations were performed on exact real numbers (that is, with infinite precision and no roundoff):

\begin{align} \exp(x) &= 1 + \frac{x}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots. \\ \exp(x) - 1 - x &= \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots. \\ \frac{\exp(x)-1-x}{x^2} &= \frac{1}{2!} + \frac{x}{3!} + \frac{x^2}{4!} + \cdots \\ &= \frac12 + \frac16x + \frac1{24}x^2 + \cdots. \end{align}

So for very small $x$ we would get a result close to $\frac12,$ unlike either of the calculations MATLAB performed for you.

Working through what is apparently happening, we start with $\exp(x),$ which is very close to $1 + x$ for small $x$. Given the limited precision of a double-precision number, for small enough $x$ the remaining terms of the series ($\frac12 x^2$ and smaller terms) are unlikely to make any difference in what number finally is stored. But what is more important, the double-precision number is not even capable of representing $1 + x$ very well.

The number $x = 10^{-10}$ is not exactly representable in double precision, and the computer ends up using all the bits of precision it has in order to represent it as closely as possible. But when we add $x$ to $1,$ we have to shift the mantissa of $x$ to the right in order to add it to the mantissa of $1,$ and when we put the result back into a double-precision number we lose several bits of precision.

Likewise when we put $\exp(x)$ into a double-precision number, where $x = 10^{-10}$, we lose much of the precision of the $x$ term of the series and essentially all of the precision of the remaining terms. The result is $1 + x + \delta = 1.0000000001 + \delta,$ where the roundoff error $\delta$ is something on the order of $\pm 10^{-16}.$

So what happens when we try to compute $y$?

We subtract $1,$ and end up with $10^{-10} + \delta,$ where $\delta$ is still on the order of $\pm 10^{-16}.$ That is, instead of the usual $17$ bits (approximately) of decimal precision we enjoy in double-precision floating point numbers, we may have as little as $7$ bits of precision.

Now we subtract $x$, which is $10^{-10}$ but is represented only approximately in double precision. Still, the error of the representation of $x$ is at most something on the order of $10^{-26},$ which is very small compared to the error we already have in the representation of $\exp(x) - 1.$ So after subtracting $x$ we have $\delta'$, where $\delta'$ is either $\delta$ or something very close to $\delta,$ possibly something on the order of $\pm 10^{-16}.$

Finally we divide by $x^2,$ that is, we multiply by $10^{20}$ (approximately, since $x$ was not represented exactly in the computer). We end up with something possibly on the order of $\pm 10^{-16} \times 10^{20} = \pm 10000.$ And (no surprise) what we actually get is $827.404.$ This is actually more "accurate" than we have any right to expect; if you do the same calculations in python but set $x$ to 1.01e-10 instead of 1e-10, the result is $9525.989554036925$.

Now let's try to compute $z$.

We subtract $x$ from $\exp(x)$ and end up with $1 + \delta,$ where $\delta$ is on the order of $\pm 10^{-16}$ or less. Essentially, we get $1 + \delta''$ where $\delta''$ represents whatever bits of $x$ got lost in the roundoff. Depending on what those bits are, $\delta''$ could be $\epsilon$ (the "machine precision" of a double precision number), $-\epsilon,$ or $0.$

In this case it appears that $\delta'' = 0.$ So the result of exp(x) - x is exactly $1.$ Now we subtract $1$ and we have $0$ exactly.

Divide $0$ by a positive number and you still have $0,$ which is the final value of $z.$

For a different values of $x$ you can find that $\exp(x) - x$ does not round off to $1$ exactly, but rather to $1 \pm \epsilon.$ And then you get a value of $z$ that is not zero, but not much more accurate than the value of $y$ you got. In fact, for x = 1.04e-10 in python I find that z comes out to $-10264.635952525487.$

During the calculation of y, you mentioned that "The number x=10^−10 is not exactly representable in double precision" and "when we add x to 1, we have to shift the mantissa of x to the right in order to add it to the mantissa of 1, and when we put the result back into a double-precision number we lose several bits of precision.". I realized that when set x=2^(-26), y and z will both result is 0. I understand why z is doing this but how come y as well? Is it because 2^(-26) could be represented as double precision, resulting in a more accurate y? — cronk, Oct 03 '22 at 02:29
Apparently $\exp(2^{-26})$ is represented by $1+2^{-26}$ exactly. Likewise $2^{-26}$ is an exact value in double precision, so you can subtract $1$ and $2^{-26}$ in either order without loss of precision. However, the answer is not accurate, because $\exp(2^{-26})$ is really a tiny bit larger than $1+2^{-26}+2^{-53}$ and the answer still should be approximately $1/2,$ not zero. — David K, Oct 03 '22 at 02:51
I understand that the actual result is approximately 1/2, not zero. However, if you set x = 2^-26 and y = @(x) (exp(x)-1-x)/x^2 in MATLAB, the result is displayed as 0. So, could we consider this as a MATLAB error? — cronk, Oct 03 '22 at 14:48
I think this is generally considered a programming error, that is, you're asking a piece of software to do calculations it is not intended to do. You could look at it as an error caused by the limited number of bits in the double-precision IEEE floating-point format used by the hardware, but I think the point of studying floating-point numbers and precision in scientific computation is to learn the inherent limitations of computer floating-point representations so that you can avoid getting nonsense results from them. Either way, it's not the fault of MATLAB. — David K, Oct 03 '22 at 15:10

score 3 · Answer 2 · answered Oct 02 '22 at 17:10

$x=10^{-10}$ has a "filled" mantissa in the binary floating-point encoding, as it is not a power of $2$.

As $x^2=10^{-20}$ is zero relative to $1$ in floating point precision, $\exp(x)$ is $1+x$ rounded to floating point precision (or 1 ulp larger - unit-in-the-last-place).

$\exp(x)-1$ returns thus a truncated version of $x$, essentially its leading 5 or 6 digits. Consequently $(\exp(x)-1)-x$ will return the remainder of $x$ to this truncation.

In $(\exp(x)-x)-1$ the floating point errors are undone in the reverse order that they were introduced, so the result is zero (remember that $\exp(x)=1+x$ at this level).

Explanation for MATLAB floating point number calculation?

2 Answers2