What’s the fastest known Koblitz curve addition law for FPGA that maximizes the per-LUT throughput?

Question

The addition or multiplication laws used by large mainstream libraries achieve faster speed by using many many more operations in order to avoid larger numbers. And my problem is here: faster speeds through the way achieved on GPU/CPU would mean using needlessly more space, or rephrased I want to avoid something like being 25% faster at the cost of using 30% more die room.

Space being scarce in my case, I need an addition or subtraction law that minimizes the per slices latency instead of getting the minimum latency at the expanse of using too much space that would decrease the speed of other operations on the chip (a pipelined design would be acceptable as the elliptic curve unit would constantly have sec256k1 points that need to be subtracted or added by $G$ depending on the fastest way to do that).

fgrieu · Answer 1 · 2024-05-09T09:36:52.417

The question is for curve seckp256k1. That's the elliptic curve of Weierstrass equation $y^2=x^3+a\cdot x+b$ in the finite field $\mathbb F_p$ with $p=2^{256}-\epsilon$, $\epsilon=2^{32}+977$, $a=0$, $b=7$.

That's called a Koblitz curve because $a=0$ and $p\bmod 3=1$, implying properties^A that do not hold for all Weierstrass curves. But it's not the original kind of Koblitz curve over a binary field $\mathbb F_{2^\ell}$ such as sect283k1, which allows efficient FPGA implementations because, in such fields, addition reduces to bitwise XOR, and multiplication is carryless.

The question asks for most efficient point addition, without specifying a coordinate system. Projective coordinates, which represent $(x,y)$ by a triple $(X,Y,Z)$ with $x\cdot Z\equiv X\pmod p$ and $y\cdot Z\equiv Y\pmod p$ would seem attractive, because their point addition formulas are efficient. That's however not the case, because based on the OP's earlier questions, the underlying goal is to find integer $k$ such that the hash of the compressed representation of $[k]G$ has a certain start, with $k$ in a given interval $[k_0,k_1[$ with $k_1-k_0$ relatively small (say about 68-bit) compared to seckp256k1's 256-bit order $n$. We can move from one candidate $[k]G$ to the next $[k+1]G$ by a single point addition. But we need the Cartesian coordinates of each point that we explore. Conversion from projective coordinates to Cartesian requires a modular inversion, which is costly, and offsets the benefit of performing the point additions in projective coordinates.

Therefore I think that the best point addition method uses some variation of the standard point addition formulas in Cartesian coordinates (case 4 of points not sharing the same $x$ coordinates, since we can restrict to $1<k_0<k_1<n$ where $n$ is secp256k1's order), which obtain the coordinates $(x_3,y_3)$ of the sum of points of coordinates $(x_1,y_1)$ and $(x_2,y_2)$ as $$\begin{array}{llcll} a.&&e&:=&(x_2-x_1)^{-1}&\bmod p\\ b.&&\lambda&:=&(y_2-y_1)\cdot e&\bmod p\\ c.&&x_3&:=&\lambda^2-x_1-x_2&\bmod p\\ d.&&y_3&:=&\lambda\cdot(x_1-x_3)-y_1&\bmod p \end{array}$$ That uses 1 modular inversion (step a.) which dominates the cost, 2 modular multiplications (steps b. and d.), 1 modular squaring (step c.), 6 comparably inexpensive modular subtractions.

If we repeatedly add the point of fixed coordinates $(x_G,y_G)$, there's a slight benefit in maintaining $(x',y')$ with $x'=(x_G-x)\bmod p$ and $y'=(y_G-y)\bmod p$ instead of $(x,y)$, in the following algorithm:

With a standard PC
- Compute the Cartesian coordinates $(x,y)$ of $[k_0]G$ (that's a point multiplication)
- Compute $x':=(x_G-x)\bmod p$
- Compute $y':=(y_G-x)\bmod p$
In an FPGA, repeat for counter $c$ form $0$ onward
1. Compute $x:=(x_G-x')\bmod p$
2. Compute $y:=(y_G-y')\bmod p$
3. Hash the bytestring representing $(x,y)$ in compressed representation; if it has the required property, stop the FPGA computation with result the current $c$.
4. Compute $e':=x'^{-1}\bmod p$
5. Compute $\lambda:=(y'\cdot e')\bmod p$
6. Compute $x':=((3\,x_G\bmod p)-x'-\lambda^2)\bmod p$
7. Compute $y':=((2\,y_G\bmod p)-\lambda\cdot x')\bmod p$
With a standard PC
- compute $k:=k_0+c$.

With $3\,x_G\bmod p$ and $2\,y_G\bmod p$ precomputed, we saved a modular subtraction compared to the standard equations, and in the modular subtraction of step 2 we only need to compute the low-order bit of $y$, because the others do not enter the hash^B. Also steps 1/2/3 can now run in parallel with 4/5/6/7.

Leaving step 3 aside, most of the effort is in the modular inversion of step 4. Among the ways to perform this:

$e':=x'^{(p-2)}\bmod p$, computed with 255 modular squarings and 248 modular multiplications using the standard left-to-right method. An improvement of that using a fixed addition chain with a single extra 256-bit cached value can save the vast majority of the modular multiplications.
The classic Extended Binary GCD, which manages to replace all the steps with additions, comparisons, and shifts of at-most 256-bit quantities. This is better for an FPGA implementation aiming at best efficiency per LUT.

A speed up in modular reduction modulo $p$ is possible because $p$ is a power of two minus $\epsilon$ with $\log\epsilon\ll\log p$. That's the form of primes suggested by Richard Crandall (see this for the likely reason why $\epsilon>2^{32}$ rather than $\epsilon=189$ which would allow even more savings). In particular, if $z$ is a more than 256-bit quantity, $z_0$ the low-order 256 bits of $z$ and $z_1$ the other bits, so that $z=z_1\cdot2^{256}+z_0$ with $z_0\in[0,2^{256})$, then $z'=\epsilon\cdot z_1+z_0$ is a partial modular reduction of $z$ modulo $p$. And repeating this transformation will quickly yield an at-most 257-bit reduction, which is easy to reduce to $z\bmod p$. Further, it's not indispensable to perform full modular reductions modulo $p$ at steps 4/5/6/7 (we must at steps 1/2).

Instead of adding $G$, we can add $G'=[s]G$ for some non-zero integer $s$ of small magnitude, chosen such that computations are slightly easier, with the easy necessary adjustments made externally to the FPGA. That has essentially no cost if $|s|$ is much less than the number of computation units.

For lack of hands-on experience on modern FPGAs, I won't dive into the choice of how deeply pipelined the design should be, or the best ways to implement the significant amount of multiplications.

^A In particular there are integers $\alpha\in[2,p)$ with $\alpha^3\bmod p=1$. Therefore if $(x,y)$ is on the curve then so are the other two points $(\alpha\cdot x\bmod p,y)$ and $(\alpha^2\cdot x\bmod p,y)$. This efficient endomorphism allows some speedup in point multiplication. For secp256k1, $\alpha=2^{(p-1)/3}\bmod p$ is suitable.

^B Alternatively, we could skip step 2 and instead perform two hashes at step 3, one for each parity of $y$, performing a verification of the candidate $k$ externally. But performing step 2 is cheaper than the extra hash.

What’s the fastest known Koblitz curve addition law for FPGA that maximizes the per-LUT throughput?

1 Answers1