10

The NaCl ref implementation of Poly1305 performs modular multiplication to calculate a polynomial $\mod 2^{130} - 5$ using the following modular multiplication function:

static void mulmod(unsigned int h[17],const unsigned int r[17])
{  
  unsigned int hr[17];
  unsigned int i, j, u;

  for (i = 0;i < 17;++i) {
    u = 0;
    for (j = 0;j <= i;++j) u += h[j] * r[i - j];
    for (j = i + 1;j < 17;++j) u += 320 * h[j] * r[i + 17 - j];
    hr[i] = u;
  }
  for (i = 0;i < 17;++i) h[i] = hr[i];
  squeeze(h);
}

... where squeeze() is a reduction. A sequence of mulmod() operations is rounded off with a freeze() operation:

static const unsigned int minusp[17] = {
  5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 252
} ;

static void freeze(unsigned int h[17])
{
  unsigned int horig[17];
  unsigned int j, negative;

  for (j = 0;j < 17;++j) horig[j] = h[j];
  add(h,minusp);
  negative = -(h[16] >> 7);
  for (j = 0;j < 17;++j) h[j] ^= negative & (horig[j] ^ h[j]);
}

This implementation has no documentation, and has been copied literally to various projects (the code above is actually from the libsodium fork).

I've compared this to the various multiplication algorithms on Wikipedia, Montgomery Reduction and Daniel Bernsteins explicit CRT method but I don't recognise it as any of them.

The most basic implementation of the overall Poly1305 using gmp simply uses add/mul/mod operations, so there's nothing that esoteric in the main loop.

The reduction and the freeze seem to indicate this is some kind of residue number system but I can't find explicit descriptions of how that would work.

One of Daniel Bernsteins motivations when writing this was probably to have a constant time implementation, if that helps identify the algorithm.

Does anyone recognize the algorithm being used here, and know of/can provide a complete explanation of it (i.e. one that explains how the magic numbers 320 and minusp are derived, and the basis for all xor operations in the freeze)?

Squeamish Ossifrage
  • 49,816
  • 3
  • 122
  • 230
archie
  • 1,998
  • 17
  • 28

1 Answers1

9

Numbers get represent as in base 256, i.e. $h = \sum_{i=0}^{17} h_i \cdot 256^i$. Since ints are used which are significantly larger than bytes you don't need to propagate carries immediately.

If you forget about modular reduction, then the $i$th digit of the result is computed as $\sum_{j=0}^i h_j\cdot r_{i-j}$. Apart from the lack of carry this is pretty much the same as schoolbook multiplication.

Now to take modular reduction into account you take the high bits of the result starting with bit 130. Now you can shift these 130 bits to the right, multiply them by 5 and add it to the unreduced result. Now shifting to the right by 130 bits can be done by shifting to the right by 17 bytes and then shifting 6 bits to the left. Shifting 17 bytes to the right is done using index arithmetic in r[i + 17 - j]. Shifting 6 bits to the left is equivalent to multiplying with 64. Combining that with the multiplication with 5, gives you a multiplication by 320.

for (i = 0;i < 17;++i)
{// compute the i-th byte of the output 
    u = 0;
    for (j = 0;j <= i;++j) u += h[j] * r[i - j];// schoolbook multiplication
    for (j = i + 1;j < 17;++j) u += 320 * h[j] * r[i + 17 - j];// 5 * overflow >> 130
    hr[i] = u;
}

So the first inner loop is the contribution of those digits that didn't overflow, and the second inner loop the contribution of the overflow.

freeze seems to simply be:

if(h > p)
    return h - p;
else
    return h;

Except it uses bitwise operations to avoid the branch. minusp is $2^{8 \cdot 17}-(2^{130}-5)$, i.e. it simply represents $-p$. It can be used to reduce modulo $p$ if the input is less than $2p$.

Since the individual elements of h can be larger than 256 but are in base 256 you need to propagate carries at the end of each multiplication. This is done using squeeze. Now after propagating the carries you get a number between 0 and $2p-1$. Having a slightly too large value as the result of an intermediate step isn't a problem, so no further reduction happens by default.

But at the of the MAC you need a number between 0 and $p-1$, so freeze is used as final reduction.

CodesInChaos
  • 25,121
  • 2
  • 90
  • 129