Is there still a fast invsqrt magic number for float128?

Question

template<class T,class U>
T union_cast(U data){
    union{U a;T b;}t{data};
    return t.b;
}
float128_t quick_invsqrt_with_magic_num(int128_t mnum,float128_t X){
    auto x= union_cast<float128_t>(mnum - (union_cast<int128_t>(X) >> 1));
    return x;
}

I'm trying to figure out a fast invsqrt hack for float128 to complete my library, and based on the R0 value given in Chris Lomont's paper, I get that the magic number should probably be 0x2FFF6EC85E7DE30DAABC6027118577EF.

int main() {
    cout << (190_128 >> 1);
    cout << int128_t(r0*(1_u128 << 23)+0.5);
    cout << endl;
    cout << (24574_128 >> 1);
    cout << int128_t(r0*(1_u128 << 112)+0.5);
    cout << endl;//理论最佳值？（来自FAST INVERSE SQUARE ROOT form CHRIS LOMONT）
}

But in my test programme this theoretical magic number generates NaN 14 times in 36 test data, which is never seen in invsqrt for float64. I tried to generate the magic number with random numbers and checked its quality repeatedly, the programme worked for a whole day on my PC and the best result I got was 13 NaNs out of 36 tests.

2FFF6EC85E7DE30DAABC6027118577EF : 2.53097e+237
not good enough on 2 13 7.2e+73 1.14514e+196 980 0.142857 0.0588235 0.0138889 0.0136986 0.000499251 0.000497265 0.000495786 0.00049334 0.000502765
bad rate 14/36 (38.8889%)

Is it theoretically impossible to derive fast invsqrt magic numbers for float128 and larger IEEE floating point numbers? If so, why? If not, is there any more accurate way to get fast invsqrt magic numbers for IEEE floats with arbitrary parameters?

A link to an online run of the code I used for testing here.

your code has typedef long double float128_t which is incorrect on most platforms: on x86_64 it maybe be the 80bit x87 format (perhaps packed in 96 bits or 128 bits), or an alias for double; on other platforms it's usually an alias for double and rarely (e.g. aarch64 linux) an alias for quadruple precision. — Claude, Oct 14 '23 at 12:04
Considering the inherent inaccuracy of the result, I wonder what is the purpose of this for high-precision numbers. Can you normalize the number to $[1,4]$ by shifting by an even number of bits (or to whatever interval the algorithm performs on the best), convert it to float32, use the well-known algorithm, then renormalize it? — Amateur_Algebraist, Oct 14 '23 at 12:43
@Amateur_Algebraist I think the added complexity of that approach would lose in terms of speed, if you just wanted to get the 2% rel.err. approximation (one bit shift and one subtraction, of 128bit numbers, plus reinterpretation of bits); but for finding an accurate rsqrt(), maybe it would be faster to adopt a mixed precision approach (e.g. use approximation to get 5 bits of accuracy in a half, then do newton steps expanding precision via single and double to get to 40 bits, finally 2 steps in float128 to get to full accuracy, instead of 5 steps in float128; range reduction is tricky tho). — Claude, Oct 14 '23 at 12:59
@Amateur_Algebraist for an rsqrt() accurate to about the 113bits limit of float128 (but not correctly rounded) I benchmarked a mixed precision version vs simple implementation; mixed precision is about 2.3x* faster (with 2.8x longer source code) than naively using float128 throughout (*on one system with AMD 2700X CPU). The range reduction also ensures correct behaviour for denormalized numbers. — Claude, Oct 14 '23 at 15:59
and using the obvious 1 / sqrtf128(x) is 3.1x faster than my mixed precision version: glibc's math library obviously has had more work put into it than my attempt — Claude, Oct 14 '23 at 16:20

Claude · Accepted Answer · 2023-10-14T12:21:36.717

The paper's method can be extrapolated to give a way to calculate the magic numbers for any floating point type, but finding $r_0$ (which doesn't depend on floating point type) is complicated, so I just used the given constant.

This snippet of Haskell code calculates the magic number for a particular floating point type given the value of $r_0$ from the paper:

import Data.Ratio ((%))
import Numeric (showHex) -- used later
r0 :: Rational
r0 = 0.432744889959443195468521587014
magic :: Integer -> Integer -> Rational
magic exponentBits explicitMantissaBits = decode 2 explicitMantissaBits (0, 3 * 2 ^ (exponentBits - 2) - 2 - (2 ^ (exponentBits - 1) - 1), floor (2 ^ explicitMantissaBits * (1 + r0) + 0.5))
decode :: Integer -> Integer -> (Integer, Integer, Integer) -> Rational
decode base mantissaDigits (s, e, m) = (if s == 0 then id else negate) (m % 2 ^ mantissaDigits) * 2^^e   -- here (num % den) makes a rational number

Then you need to encode it into a binary floating point format, see below for inefficient code (but that is hopefully correct). Then, printing the integer in hex gives these magic numbers for half/single/double/quadruple, which last is not the same as yours (how did you calculate it?):

59bb
5f37642f
5fe6ec85e7de30db
5ffe6ec85e7de30dab0ff77452ab6769

Appendix 1 (binary format conversion):

encode :: Integer -> Integer -> Rational -> (Integer, Integer, Integer)
encode _ _ 0 = (0, 0, 0)
encode base mantissaDigits n
  | n < 0 = case encode base mantissaDigits (negate n) of (s, e, m) -> (1 - s, e, m)
  | n < 1 = case encode base mantissaDigits (n * fromInteger base) of (s, e, m) -> (s, e - 1, m)
  | n >= fromInteger base = case encode base mantissaDigits (n / fromInteger base) of (s, e, m) -> (s, e + 1, m)
  | otherwise = (0, 0, round (n * fromInteger (base ^ (mantissaDigits - 1))))
binary :: Integer -> Integer -> Integer -> (Rational -> Integer, Integer -> Maybe Rational, Integer)
binary signBits exponentBits explicitMantissaBits = (enc, dec, enc (magic exponentBits explicitMantissaBits))
 where
  dec :: Integer -> Maybe Rational
  dec n
    | biasedExponent == 2 ^ exponentBits - 1 = Nothing
    | biasedExponent == 0 = Just 0 -- todo subnormal
    | otherwise = Just (decode 2 (explicitMantissaBits) (sign, unbiasedExponent, mantissa))
    where
     sign
       | n >= 2 ^ (exponentBits + explicitMantissaBits) = 1
       | otherwise = 0
     unbiasedExponent = biasedExponent - (2 ^ (exponentBits - 1) - 1)
     mantissa = implicitMantissa + 2 ^ explicitMantissaBits
     implicitMantissa = n mod 2 ^ explicitMantissaBits
     biasedExponent = (n div 2 ^ explicitMantissaBits) mod 2 ^ exponentBits
  enc :: Rational -> Integer
  enc n = case encode 2 (explicitMantissaBits + 1) n of
   (s, unbiasedExponent, explicitMantissa)
    | n == 0 -> 0
    | biasedExponent <= 0 -> 0 -- underflow, todo subnormal
    | biasedExponent >= 2 ^ exponentBits - 1 -> (sign * 2 ^ exponentBits + 2 ^ exponentBits - 1) * 2 ^ explicitMantissaBits -- overflow, infinity
    | otherwise -> (sign * 2 ^ exponentBits + biasedExponent) * 2 ^ explicitMantissaBits + implicitMantissa
    where
      sign
        | signBits > 0 = s
        | otherwise = 0
      mantissaBits = explicitMantissaBits + 1
      biasedExponent = unbiasedExponent + (2 ^ (exponentBits - 1) - 1)
      implicitMantissa = explicitMantissa - (2 ^ (mantissaBits - 1))
half = binary 1 5 10
single = binary 1 8 23
double = binary 1 11 52
quadruple = binary 1 15 112
test float@(enc, dec, mgc) n = case dec (enc (toRational n)) of
  Just m -> do
    putStrLn (showHex mgc "") -- magic number
    print (n - fromRational m) -- should be small
    let r = rSqrtLinear float (toRational n)
        l = toRational n * r * r
    print (fromRational l) -- should be 1
-- here pi defaults to Double precision floating point
main = do
  test half pi
  test single pi
  test double pi
  test quadruple pi
rSqrtLinear (enc, dec, mgc) x = case dec (mgc - (enc x div 2)) of Just y -> y

Appendix 2 (test code in C):

#define _GNU_SOURCE
#include <math.h>
#include <stdio.h>
typedef _Float128 float128_t;
typedef __int128 int128_t;
int128_t decode(float128_t f)
{
  return (int128_t)&f;
}
float128_t encode(int128_t i)
{
  return (float128_t)&i;
}
float128_t magic_f128 = 7.813819087707539518680007964123963e2465f128;
int128_t magic_i128;
float128_t rsqrt_approx(float128_t x)
{
  return encode(magic_i128 - (decode(x) >> 1));
}
float128_t relative_error(float128_t x)
{
  float128_t y = 1 / sqrtf128(x);
  return fabsf128(rsqrt_approx(x) - y) / y;
}
float128_t test_cases[] =
{
2,7,17,13,72,73,2003,2011,2017,2027,1989,
72893,72901,72907,72911,72923,72931,72937,72949,72953,
9.99e+37,72e72,17e17,114514e191,980,
1.0/3,1.0/7,1.0/17,1.0/13,1.0/72,1.0/73,1.0/2003,1.0/2011,1.0/2017,1.0/2027,1.0/1989,
1.0/72893,1.0/72901,1.0/72907,1.0/72911,1.0/72923,1.0/72931,1.0/72937,1.0/72949,1.0/72953,
1.0/9.99e+37,1.0/72e72,1.0/17e17,1.0/114514e191,1.0/980,
};
float128_t rms_error(void)
{
  int count = sizeof(test_cases) / sizeof(test_cases[0]);
  float128_t sum = 0;
  for (int i = 0; i < count; ++i)
  {
    float128_t err = relative_error(test_cases[i]);
    sum += err * err;
  }
  return sqrtf128(sum / count);
}
int main(int argc, char **argv)
{
  magic_i128 = decode(magic_f128);
  printf("%g\n", (double) rms_error());
  return 0;
}

This test code (compiled with gcc version 12.2.0 (Debian 12.2.0-14) x86_64) outputs 0.0198455, which shows the algorithm is working (relative error about 2%).

Idea for small speedup may be applicable to a software floating point implementation: I haven't checked, but it will probably work to take the leading 32 bits of the binary representation of the float128 (giving 1 sign bit, 15 exponent bits, 16 explicit mantissa bits) and use the approximation with magic number 0x5ffe6ec8 (leading 32bits of float128 magic number). The trailing 96 bits of the mantissa can be left untouched, as they will be less significant than the error of the approximation. — Claude, Oct 15 '23 at 11:14
That's a great idea and it's actually how the "original" 1986 version worked. Kahan and Ng split a double into 32 bit halves and used a 32 bit restoring constant. https://gist.github.com/Protonk/f3c5bb91f228ffec4d4c5e2eb16e489d — Adam Hyland, Dec 14 '24 at 17:02

Is there still a fast invsqrt magic number for float128?

1 Answers1