What is the most computationally efficient way of generating pseudo-random permutations?

Question

I have an application in which I need to create up to J randomly shuffled-copies of an array of length N. Then I will have millions or even billions of iterations such that, in each iteration, I will have to fetch the value of K << N entries of the J permuted copies of the original array. The K entries that need to be accessed are the same for all the J shuffled-copies but that set of K entries changes from one iteration to another.

To give an approximate idea of the dimensions of the problem, assume J = 25000, N = 500000 and K = 50 even if those numbers are not fixed a priori.

The naive approach, which I've been following so far, is to use the Fisher-Yates algorithm (also known as Knuth shuffle) to create the permuted versions of the array, storing them as a NxJ matrix in memory. However, that matrix tends to be too large in many cases, which is problematic. Also, recomputing the entire matrix in each iteration would be painfully slow.

As an alternative, I have started to consider using pseudo-random permutations instead. Using a block-cipher for which the enciphering process had complexity O(1), I could ideally fetch the K entries I need with an extra complexity O(KJ) per iteration. Which appears to be acceptable empirically.

My question, as someone with an extremely rudimentary understanding of cryptography, is which scheme would you recommend me to perform such a task. The key point is that I would rather have a scheme for which enciphering is really fast, even if the security is very poor for cryptographic standards. As long as the set of distinct permutations which can be generated (and look random enough) is in the order of 2^32, it's perfectly fine. AES-like security levels would be a clear overkill.

As a reference, I am currently using a "handmade" 4-round Feistel network with ceil(log2(N)) bits and very simple round functions (16-bit round keys, 8-bits are XORed with one half of the plaintext and the result is passed through a S-box, the other 8-bits are used as an index for a set of 256 random bit shifts). I am also using a cycle-walking scheme to deal with the (common) cases for which N is not a power of 2.

I guess the scheme is a bad joke for cryptographic standards but, for my application (statistical hypothesis testing) it appears to be random enough. It would be great if I could reduce the complexity of the encryption scheme a bit.

I would be really thankful is any of you could provide some advice!

TL;DR I am looking for the fastest possible way to generate pseudorandom permutations of a set of size N. I care much more about the speed than the security of the scheme (about 2^32 "truly" random permutations is more enough).

score 8 · Answer 1 · edited Oct 09 '20 at 11:29

Use format-preserving encryption. The current NIST standards-track mode FFX should be sufficiently fast for your purposes. For your domain size, you might also want to try the swap-or-not shuffle, a new construction that is also pretty fast and dead simple to implement. To get the absolute best speed form these schemes you should use a single AES call as your PRF, preferably with AES-NI instructions if you have them.

EDIT: Any PRP generator based on an oblivious card shuffle should work for your purposes. Here's another reference explaining what that is.

score 5 · Answer 2 · edited Apr 13 '17 at 12:48

I explain, criticize and try to improve the technique in the question (which asks for speed by using cryptographic techniques for arguably satisfactory functionality in a statistical simulation). Shuffling, and full-blown Format-Preserving Encryption aim at perfect or demonstrable cryptographic security, a different goal.

Under the assumption that the (unstated) distribution of the entries to compute is rather arbitrary, and storing $J\cdot N\cdot\lceil\log_2(N)\rceil$ bits is not a viable option, the general technique in the question seems best: build an efficient keyed pseudo-random permutation $P$ of the set $\{0\dots N-1\}$, with the key (noted $y$ to prevent a clash in notation) random, or obtained from the permutation index $j\in\{0\dots J-1\}$ (e.g. as $y=y_0\oplus j$ with $y_0$ a random constant fixed at the beginning of the many iterations); then evaluate $P_y(x)$ as needed.

The question builds $P$ from a block cipher $E$ of $n=\lceil\log_2(N)\rceil$-bit block size, with $P_y(x)$ computed by the cycle-walking technique:

repeat
- $x\gets E_y(x)$
until $x<N$
output $x$

Any $P_y$ demonstrably is a permutation of the set $\{0\dots N-1\}$. $P$ is demonstrably at least as secure as $E$ is (ignoring side-channel issues, e.g. by timing or power analysis). Computing $P_y(x)$ uses on average $2^n/N<2$ evaluations of $E$, and at most $2^n-N+1<N$.

The question builds $E$ as a Feistel cipher, symmetric if $n$ is even, or mostly so if $n$ is not. It has 4 rounds of an ad-hoc round function, based on truncated AES S-boxes (see details in comments to the present questions), without gapping flaw that I could spot for $n\approx13$.

Even with a good design per the description given, there are potential issues: the Luby-Rackoff result (How to Construct Pseudo-random Permutations from Pseudo-random Functions, in proceedings of Crypto 1985) ensuring security for a Feistel cipher after 3 rounds (4 if we consider adaptive chosen ciphertext attacks) is only valid for independent and random round functions; asymptotically when $n$ grows; and when the number of distinct inputs evaluated is much less than $2^{n/2}$ (here, $\sqrt N$), something which is not warranted in the application. More rounds are needed as more space is explored (see Jacques Patarin's Luby-Rackoff: 7 Rounds Are Enough for $2^{n(1−\epsilon)}$ Security, in proceedings of Crypto 2003).

4-round Feistel

Expanded: I'll show why with small width of $E$, we need more than 4 rounds for something even approaching cryptographic soundness. In a 4-rounds symmetric Feistel cipher as illustrated above, if two keyings use identical $F_2$ within XOR of a constant at input, and identical $F_3$, then for any $I_a$, the function $I_b\to O_b$ is identical within XOR of a constant (dependent on $I_a$) at input. This characteristic is improbable to the utmost for two sizable random permutations (odds about $2^{n(2^{n/2-1}-2^{n-1})}$, that is $2^{-24192}$ for $n=12$), and should it occur is easily detectable and conceivably could have an impact on a practical application. In the construction considered (detailed in comments to this question) for $n=12$, each round function is XOR with a 6-bit constant followed by one in $2^8$ S-boxes of 6x6 bits, thus any two random keyings have $F_2$ and $F_3$ causing that characteristic with odds $2^{-22}$, and this is expected to occur over a hundred times among $J=25000$ permutations constructed as proposed. For a practical attack, we choose an arbitrary fixed $I_a$, and partially map $I_b\to O_b$ for random $I_b$ (a little more than $2^{n/4}$ evaluations for each of the $J$ permutations will do); when we find a collision $E_b(I_a\|I_b)=E_b'(I_a\|I_b')$, we check if $\forall x,E_b(I_a\|(I_b\oplus x))=E_b'(I_a\|(I_b'\oplus x))$ (in the affirmative, we can then confirm that a similar property holds for any other $I_a$, and it is practically certain that $E$ and $E'$ share the same S-boxes in $F_2$, and in $F_3$, and the same XOR constant on entry of $F_3$). It is overwhelmingly likely that we will find a pair $(E,E')$ with such property when using the construction proposed in the question, and practically impossible for random permutations. The attack can be adapted to work for $P$ built using cycle-walking.

Corrected: Also, we can imagine applications where parity of permutation $P$ could have an impact, such as distribution of the number of swaps in sorting. Any Feistel cipher yields an even permutation, thus $E$ is even. While $N<2^n$ allows $P$ to be either even or odd, its parity will we strongly biased for many values of $N$ (argument: when $2^n-N$ is small compared to $\sqrt N$, for most $E$, the cycle-walking repeat loop is executed exactly twice for $2^n-N$ inputs, and once for all the others, so that the resulting parity of $P$ is $N\bmod 2$). So we need an ample amount of cycle-walking (this other question asks how much), or making the permutation even or odd under control of a key bit (a simple technique uses XOR of any ciphertext that is all-zero except its rightmost bit, with a key bit controlling parity), or deviating from straight Feistel (like, using modular addition instead of XOR to combine the round function's result with the half state).

More generally: A recommendable way to inject entropy in a Feistel construct of small width is to use modular addition of a round key over the whole state width (rather than XOR over half the state as in the question's Feistel construct): that injects nearly twice as much entropy, which is very desirable; and balances the permutation's parity.

With careful specification of the Feistel cipher using somewhat more rounds, more key material injected, and dealing with the parity issue (except when clearly immaterial, which includes any case where at least 2 inputs are never used), the method has merit. The idea in another answer of using AES-NI instructions to build the round function would give something very fast, but at the expense of portability, and I will not venture into this.

Here is a revised tentative, christened fastperm2, to use cryptographic techniques in order to build efficient random permutations over a small domain. I stress that I give no insurance of cryptographic security; still, I challenge one to find an attack in the random key setup much better than $2^{64-i}$ steps for odds $2^{-i}$ of success, even restricted to a particular $N$.

Rationale:

easy to code in C;
limited amount of cycle-walking, for the extra effort is better spent on more rounds;
minimally simple round function, without S-boxes, using:
- diffusion (to the left, and to some limited degree to the right) by modular multiplication with a public multiplier;
- right diffusion and non-linearity by XOR with right-shifted state;
- combination with round keys using addition over the full state, to maximize key material and deal with permutation parity;
at least 16 rounds in hope to compensate for that simplicity (I wish I could justify that value other than by analogy with serious ciphers).
enough rounds that at least 64 bits of entropy are injected in each half of the cipher discarding one round (for hoped 64-bit security against a generic meet-in-the-middle attack);
restrict domain so that $N!$ is comfortably above $2^{128}$, and state fits a 32-bit word.

Parameter selection according to $N$:

ensure $40\le N\le 2^{32}-2^{20}$ (for lower $N$, nothing beats Fisher-Yates anyway);
find the lowest $M$ with $N\le M$ such that $M\equiv2^k\pmod{2^{k+1}}$ with $k$ the integer closest to $n(\sqrt5-1)/2)$ where $n=\lceil\log_2(M)\rceil$ (thus: $6\le n\le32$, $k=\lfloor(13n+10)/21\rfloor$, and $M=(2\big\lfloor\lceil N/{2^k}\rceil/2\big\rfloor+1)2^k$);
$r\gets2\max(\lceil64/(n-1)\rceil+1,8)$, the number of rounds.
$C\gets\lfloor(28657M+23184)/46368/2^k\rfloor2^k+(\text{7D8B5}_{16}\bmod 2^k)$ (so that $C\approx M(\sqrt5-1)/2$, and the low $k$ bits of $C$ are per the constant $F_{29}\equiv5\bmod 8$);
while $\gcd(C,M)\ne1$
- $C\gets C-2^k$
$s\gets n-k$, the shift count;

Note: parameters are such that $x\to C\cdot x\bmod M$ and $x\to x\oplus\lfloor x/2^s\rfloor$ are permutations of the set $\{0\dots M-1\}$. The first transformation achieves good left diffusion in the state bits; both transformations give some right diffusion (though limited to the leftmost $s$ bits for the first transformation).

Sub-keys setup:

for each $j$ with $0\le j<r$
- set $y_j$ to a uniformly random value in $\{0\dots M-1\}$ (or using an unspecified pseudo-random function of $y$ and $j$)

Encryption of $x$ with $0\le x<N$:

repeat
- for each $j$ with $0\le j<r$ in ascending order
  - $x\gets(C(x\oplus\lfloor x/2^s\rfloor)+y_j)\bmod M$
while $x\ge N$
output $x$

It is critical that multiplication and modular reduction is implemented exactly. In C99 and assuming all variables are of type uint32_t, a round is: x = ((x ^ x>>s)*(uint64_t)C + y[j]) % M;

Plan: give a reference C implementation.

For a statistical application, the number of rounds $r$ can be lowered; I conjecture that $r\gets2\max(\lceil40/(n-1)\rceil+1,5)$ would pass any randomness test not constructed with knowledge of the cipher's structure, including any pre-existing test.

^{Revisions: Two awful typos in the round function have been fixed. I am back to considering that $n-1$ bits of entropy are injected per round (rather than $k$ in the short-lived fastperm3). Upped the entropy injected in the reduced version for statistical use, in order to lower the odds of near-identical permutations.}.

score 2 · Answer 3 · answered Nov 05 '14 at 15:45

The theoretically correct random permutation algorithem is IMHO uniquely that of Fisher and Yates [1], which needs for performing a permutation of a sequence of n items n-1 PRNs. I perviously found that a practically fairly passable result could also be achieved with 2 PRNs [2].

[1] D. E. Knuth, The Art of Computer Programming, Vol. 2, 3rd ed. p.145. [2] http://s13.zetaboards.com/Crypto/topic/7071388/1/

What is the most computationally efficient way of generating pseudo-random permutations?

3 Answers3

Linked