How to convert a set of sequential integers into a set of unique random numbers?

Question

I have a set of sequential and non-duplicate integers which I would like to convert into a set of non-duplicate random integers.

What I want to achieve is this - I have a list of sequential numbers like below

[1, 2, 3, 4, 5, 6, 7, 8, 9]

This list is going to be quite big in reality e.g. 100 million

I want to shuffle this list multiple times in order to rearrange the numbers in the list. Each shuffle should give me a new sequence in such a way that, if I pick a number at a particular position in the list after each shuffle, I should never get a duplicate.

For example, if the above list is shuffled three times, I get the following three sequences after each shuffle

shuffle 1 --> [9, 3, 4, 1, 7, 2, 6, 5, 8]
shuffle 2 --> [8, 2, 7, 5, 4, 6, 3, 9, 1]
shuffle 3 --> [1, 9, 6, 7, 5, 3, 4, 8, 2]

In the above lists, if you pick a number at any position, say 2, from all the sets, you never get any duplicates. You will get a duplicate if you pick position 2 from the first sequence and position 5 from the third set and this is fine.

For some reason, I feel that there should be a cryptographic function for this kind of shuffling to which I can pass my input sequence and I get a shuffled sequence back as described above. Is there any? I have looked at LFSR but just not sure if that solution works in this case.

score 9 · Answer 1 · edited Apr 13 '17 at 12:48

Let $P$ be an arbitrary random fixed permutation of $n$ elements. Each of the desired shuffle can be generated from the previous as follows:

apply $P$ to the previous shuffle (or the initial set/vector, the first time);
rotate the result one to the left;
apply $P^{-1}$ to the result.

This will have the property asked by "if I pick a number at a particular position in the list after each shuffle, I should never get a duplicate". Proof: in applying the shuffle sequentially, the next application of $P$ with cancel the previous application of $P^{-1}$. It follows that the rotation will operate on the result of the previous rotation, thus that the property is met at the output of the rotation. It follows that the property is met after applying $P^{-1}$, thus at the output.

We can iterate this $n-1$ times (if we do it once more, we are back to the original, thus will get something clearly sequential, thus in the context distinguishable from the required "random integers" with great ease and near certainty).

We can use the Fisher-Yates shuffle to generate $P$. Or, if keeping it in memory is an issue because of size, we can use a cipher to implement $P$, and its reverse; that's a standard trick of Format Preserving Encryption.

Any isolated shuffle is indistinguishable from a random shuffle, but even the first two shuffles generated are distinguishable from random shuffles with the required property (which perhaps is undesirable, if not clearly prohibited).

If we remove the requirement that the same operation is iterated to go from one shuffle to the next (we are told that we can), we can do something better: pick three arbitrary random permutations $P$, $Q$, $R$ of $n$ elements, and get the $(j+1)^\text{th}$ shuffle as

apply $P$ to the initial state
rotate the result $Q(j)$ times to the left, where $Q(j)$ is the $(j+1)^\text{th}$ element of the vector obtained by applying $Q$ to the vector of the first $n$ non-negative integers; or, otherwise said, $Q(j)$ is an integer $q_j$ with $0\le q_j<n$, such that $0\le i<j<n\implies q_i\ne q_j$.
apply $R$ to the result.

This time, we can obtain $n$ individually random-looking shuffles, which obviously is the maximum. Again, Fisher-Yates or FPE can be used to implement $P$, $Q$ and $R$. Even knowing the initial shuffle and the rank of any two generated shuffles, I vaguely conjecture (without proof) that they are indistinguishable from two random shuffles meeting the property asked. By a counting argument, three generated shuffles are distinguishable, at least by an unbounded adversary; thus we are far from generating the best randomness possible.

I have asked the question in more academic terms there.

Here is sample code in Java for the second method. Class MyCipher implements a Pseudo Random Permutation of size elements, made using cycling (arguably the simplest efficient FPE technique), and a basic iterated block cipher. Class MyShuffledArray is then a straightforward implementation of the technique that I propose. The code is usable for large parameters, and uses constant memory.

import java.util.Random;
import java.security.SecureRandom;

public class MyShuffledArrayDemo {

    // MyCipher implement a PRP of size elements
    static class MyCipher {
        private static final int R = 40;        // number of rounds
        private final int size;                 // size of PRP
        private final int mask;                 // bit mask for block
        // 0 < size <= mask+1  and mask+1 is a power of 2
        private final int rish;                 // right shift count
        private final int[] rk = new int[R];    // rk for each round

        // Constructor from rng source
        private MyCipher(int size, Random rng) {
            assert size>0 : "MyCipher.size must be positive";
            int i,j;
            // find block cipher width j in bits; and mask 
            for (j = 3, i = 8; j<31 && i<size; ++j)
                i += i;
            this.size = size;
            this.mask = i-1;    // one less than a power of two
            this.rish = j*3/7;  // shift count, at least 1
            i = R; do
                rk[--i] = rng.nextInt();
            while (i!=0);
        }

        // Implement PRP, using a basic iterated block cipher and cycling.
        // Input and output are a non-negative integer less than size.
        private int Perm(int x) {
            assert x>=0 && x<this.size : "bad input to MyCipher.encrypt";
            do { // cycling loop; executed on average less than twice when size>4
                int r = R;
                do {// Round loop; each losslessly transforms x by
                    // - multiplying by 0xADB modulo a power of 2
                    // - adding a round key
                    // - XORing x with a right-shifted version
                    // Here, x<= mask  and mask+1 is a power of 2
                    x = (x*0xADB+this.rk[--r]) & this.mask;
                    x ^= x>>>this.rish;
                    }
                while (r!=0);
                }
            while (x>=this.size);
            return x;
        }

    } // class MyCipher

    // MyShuffledArray implement a virtual square array where each line and column appears to be a random permutation
    static class MyShuffledArray {
        private final int size;         // dimension
        private MyCipher P,Q,R;

        // Constructor from rng source
        private MyShuffledArray(int size, Random rng) {
            assert size>0 && size<=0x40000000: "MyShuffledArray.size must be positive and at most 1073741824";
            this.size = size;
            P = new MyCipher(size, rng);
            Q = new MyCipher(size, rng);
            R = new MyCipher(size, rng);
            }

        // Implement the virtual array
        private int Get(int col, int lin) {
            return R.Perm((P.Perm(col)+Q.Perm(lin))%this.size);
            }

        } // class MyShuffledArray

    // Example use
    final static int START_OF_RANGE  = 1;
    final static int END_OF_RANGE    = 9;
    final static int NUMBER_OF_LISTS = 3;
    final static int size = END_OF_RANGE - START_OF_RANGE + 1;

    public static void main(String[] args) {
        MyShuffledArray myShuffledArray = new MyShuffledArray(size, new SecureRandom());
        for (int j = 0; j <NUMBER_OF_LISTS; j++) {
            for (int i = 0; i <size; i++)
                System.out.print(String.format(i==0?"[%d":", %d", myShuffledArray.Get(i,j) + START_OF_RANGE));
            System.out.println("]");
        }
    }
}

This answer identifies that we are generating a latin square. Here is a literature survey.

score 4 · Answer 2 · answered Oct 14 '16 at 20:13

Let's rephrase what you want. You want a set of sequences with two properties.

1) For any element of the set, that sequence contains no more than one of each possible element.

2) In any two elements a and b of the set, the nth element of sequence a does not equal the nth element of sequence b.

Instead of imagining a set of sequences, imagine a square matrix. In any given column you expect no element to appear twice. In any row you also expect no element to appear twice. (Sort of like a Sudoku table.) In other words, any given row is a permutation and any given column is a permutation. So if you have a N by N table with values 1 through N you can extract the nth shuffled sequence by looking at the nth row of numbers.

You can apply a pseudorandom function to a sequence of unique numbers. If that function is injective then by definition the sequence of outputs for that function will not have duplicates. For simplicity suppose you have a whole bunch of bijective pseudorandom functions available to you. Bijective functions are injective, so for any bijective function F with domain and codomain with integers 1 through n, the sequence [F(1), F(2), F(3), ... F(N)] is a permutation of the values 1 through N.

You can apply another bijective function to permute your first permutation, resulting in a permutation again. This is because the composition of bijective functions is a bijective function. This is how Even-Mansour encryption works. Simple things like XOR and modular addition are bijective but don't appear random. An XOR-Encrypt-XOR operation chain is used to create block ciphers from unkeyed permutations or tweakable block ciphers from a normal block cipher.

Subkeys of an EM cipher are the same size as the block size. Since the subkeys are combined at the input stage by xor, this makes the input and first subkey in a sense interchangable because xor is a commutative operation.

So instead of using these subkeys to create a secret permutation, let's suppose we don't care if our shuffle function is cryptographically secure. Then let's use a two stage EM cipher with the first two subkeys being our x and y values, used as tweak values, and a constant, say zero, for cipher input. $F(x, y) = P(P(x \oplus 0) \oplus y) \oplus constk$. If we leave x constant we can show that that $F_x(y)$ is bijective. If we leave y constant and vary x instead we can similarly show that $F_y(x)$ is bijective.

Here is a pastebin showing the same idea. I use look up tables to create small bijective functions. Two permutation rounds are sufficient to get the properties of uniqueness you want, but the results wouldn't look random. I add the third lookup table round to "reshuffle" the non-random permutation that is the result of the first two rounds. This demonstration is not cryptographically secure, but if this was based on a real EM cipher and a couple extra rounds and secret keys were added to the end of the algorithm it could be. I use modular arithmetic instead of xor because it has the same properties but also works for a non-power of two.

What is interesting about this method is that you can access the xth element of the yth shuffle in O(1) time. The downside is that it you need a bijective pseudorandom function for whatever domain size you choose. It is fairly easy to select an easily implementable function if the size is a power of two or if the size is small enough to implement using a lookup table. For unknown arbitrary sizes you would need to choose a good bijective function for that domain size. If you choose a larger block size than you really need, then you can truncate the matrix to the first n rows and first n columns. Not ever element of the codomain is represented in the sub-matrix and not every row will contain the same elements. (Each row would be a random subset of the codomain.)

Though you asked it on crypto, I hope you're not looking to implement something cryptographically secure. If a cryptographically secure method of generating these sequences didn't occur to you easily, then you don't have the prequisite knowledge to know if you implemented it right. If you need a specialized algorithm for an application that doesn't require security, like a video game, simulation, or odd non-cryptograaphic hash function, then you could use my demo to build something. For a non-crypto application I would recommend using Murmur Hash 3's 64 bit variant's finalization routine as your public-kownledge pseudo-random permutation instead of a look up table. (You can then use xor instead of modular addition.)

Mok-Kong Shen · Answer 3 · 2016-10-15T08:43:52.593

The algorithm of Fisher and Yates (see Knuth, The Art of Computer Programming, vol. 2) does a pseudo-random permutation of a sequence without however avoiding possible duplicates in your sense. One can insert a test in the algorithm to avoid duplicates so that the end result will be one without duplicates with respect to the original set.

[Added on edit:] One can do somewhat better, i.e. without testing, via modifying the range of the PRNs used in the algorithm. Below is a Python code that pseudo-randomly permutes a list lista (assumed having no duplicated elements and with length >= 2) to become a list listb such that your requirements are satisfied. Note that we have employed the list indexing convention of Python, which is different from that in Knuth p.145.

import random
def specialrandompermutation(lista):
  lenlist=len(lista)
  listb=lista[:]
  j=lenlist-1
  while j > 0:
    k=random.randint(0,j-1)
    temp=listb[j]
    listb[j]=listb[k]
    listb[k]=temp
    j=j-1
  return(listb)

A test example of the code is:

mylist=[0,1,2,3,4,5,6,7,8,9]
for i in range(5):
  newlist=specialrandompermutation(mylist)
  print(newlist)

[Added on 2nd edit:] On re-reading your OP I must remark that my code only ensures that the input lista and the output listb don't have the same elements at any positions. However, if a 2nd invocation of the code with the same input lista leads to the output listb2nd, listb and listb2nd may well have the same elements at certain positions. In fact listb and listb2nd have a certain finite probability of being equal to each other.

The original algorithm of Fisher and Yates would have the statement k=random.randint(0,j) instead of k=random.randint(0,j-1) of our code above.

score 2 · Answer 4 · edited Apr 13 '17 at 12:48

The simplest but also the most dangerous thing to do is to shuffle all the lists sequentially, checking if the previous shuffled sets contain the an identical value at the same position. If it does then you can simply re-shuffle.

The problem with this is of course that this is a non-deterministic scheme; it is not known how many shuffles are required. The question if it ends cannot be answered, although the likelihood that it will increases with the amount of tries. That is: if there is still a solution possible (if your set contains 2 element, you can only do two shuffles that comply with your scheme).

The if the size of the set isn't sufficiently larger than the required number of solutions then the running time will however very quickly increase dramatically to unfeasible numbers. In other words, the big-O of this solution seems to be of order $n!$ or something similar.

Your comment below the question indicates very large sized lists and a large number of lists, for which this this solution is entirely infeasible.

Moh-Kong Shen proposes a specific algorithm to perform the shuffling. Of course any cryptographically secure shuffle can be used, although the Fisher and Yates shuffle does make sense.

Here's some code to run, just for the fun of it. It's Java 8, without additional libraries. I strongly suggest to play around with the constant values to get a feel of the running time. It hides the (seeded) PRNG and shuffle implementation, but as long as those are OK, it doesn't matter.

import java.security.SecureRandom;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.logging.Logger;

public class NonDeterministicShuffle {

    private static final int START_OF_RANGE = 1;
    private static final int END_OF_RANGE = 10;
    private static final int AMOUNT_OF_LISTS = 3;

    public static void main(String[] args) {

        // --- setup ---
        final ArrayList<Integer> list = new ArrayList<>();
        for (int i = START_OF_RANGE; i <= END_OF_RANGE; i++) {
            list.add(i);
        }

        final ArrayList<ArrayList<Integer>> lists =
                new ArrayList<>();
        lists.add(list);
        for (int i = 1; i < AMOUNT_OF_LISTS; i++) {
            @SuppressWarnings("unchecked")
            final ArrayList<Integer> clone =
                    (ArrayList<Integer>) list.clone();
            lists.add(clone);
        }

        shuffleToUniqueValueAtAllIndices(lists);

        // --- print out results ---
        for (final ArrayList<Integer> toPrint : lists) {
            System.out.println(toPrint);
        }
    }

    private static void shuffleToUniqueValueAtAllIndices(
            final ArrayList<ArrayList<Integer>> lists) {
        // --- create any cryptographically secure random ---
        final SecureRandom rng = new SecureRandom();

        int shuffles = 0;

        for (int li = 0; li < lists.size(); li++) {
            final ArrayList<Integer> shuffleList =
                    lists.get(li);
            final List<ArrayList<Integer>> shuffledLists =
                    lists.subList(0, li);

            do {
                shuffles++;
                // this performs the Fisher-Yates shuffle
                Collections.shuffle(shuffleList, rng);
            } while (!hasUniqueValueAtAllIndices(
                    shuffleList, shuffledLists));

        }

        Logger.getGlobal().info(String.format("Shuffles: %d%n", shuffles));
    }

    private static boolean hasUniqueValueAtAllIndices(
            final ArrayList<Integer> list,
            final List<ArrayList<Integer>> shuffledLists) {
        for (final ArrayList<Integer> toCheck : shuffledLists) {
            for (int i = 0; i < list.size(); i++) {
                if (toCheck.get(i) == list.get(i)) {
                    return false;
                }
            }
        }
        return true;
    }
}

Output for 1 to 1,000,000 range and 8 lists:

Oct 14, 2016 3:03:49 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 0

Oct 14, 2016 3:03:49 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 1

Oct 14, 2016 3:03:49 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 2

Oct 14, 2016 3:03:51 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 3

Oct 14, 2016 3:03:58 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 4

Oct 14, 2016 3:04:00 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 5

Oct 14, 2016 3:04:23 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 6

Oct 14, 2016 3:04:30 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: List: 7

Oct 14, 2016 3:07:32 PM nl.owlstead.crypto.NonDeterministicShuffle shuffleToUniqueValueAtAllIndices
INFO: Shuffles: 1257

So the first 3 lists happen in the first 2 seconds. The last list (index 7) takes more than 3 full minutes and the algorithm takes 1257 shuffles (!!!). So while this is a feasible solution for limited lists (both in size and in quantity), you may want to consider other, more efficient algorithms.

score 2 · Answer 5 · answered Oct 17 '16 at 15:11

It sounds like you are trying to build a random Latin square -- an n x n array filled with n unique symbols, such that each symbol occurs exactly once in each row and once in each column. (A Sudoko puzzle is a kind of Latin square with a few additional restrictions).

The addition table for the integers modulo n is one (not random) Latin square.

The balanced Latin square for n integers is another (not random) Latin square.

In "The Latin Square Design", Glenn Johnson suggests:

The ideal randomization would be to select a square from the set of all possible Latin squares of the specified size. However, a more practical randomization scheme would be to select a standardized Latin square at random (these are tabulated) and then:

randomly permute the columns,

randomly permute the rows, and then

assign the treatments to the Latin letters in a random fashion.

Have you looked at the algorithm of Jacobson and Matthews?

Jacobson, M. T.; Matthews, P. (1996). "Generating uniformly distributed random latin squares". Journal of Combinatorial Designs. 4 (6): 405–437. doi:10.1002/(sici)1520-6610(1996)4:6<405::aid-jcd3>3.0.co;2-j

score 1 · Answer 6 · answered Oct 14 '16 at 19:02

So here's a very simple algorithm that would work in your case:

Divide the set of sequential integers into blocks equal to the maximum number of shuffles. For example, for a maximum of 3 shuffles and 9 integers: [1, 2, 3], [4, 5, 6], [7, 8, 9].
Each "shuffle" use a cryptographic random number generator to mix up the numbers inside the blocks, e.g.: [3, 1, 2], [4, 6, 5], [9, 8, 7].
Then use a fixed pattern to swap the blocks, e.g. 3->1, 1->2, 2->3 would generate [9, 8, 7], [3, 1, 2], [4, 6, 5].
The final result after the first iteration is [9, 8, 7, 3, 1, 2, 4, 6, 5].

If the number of integers per block is sufficiently large (e.g. 128 per block) then it is impossible for an attacker to predict their order. And the moving of the blocks prevents elements from appearing at the same location.

The only weakness to this approach is that numbers in the same block will always be near to each other (e.g. in this example an attacker would know that a 9 would be 1 or 2 elements away from an 8).

To combat this we could add an additional step of applying an arbitrary random fixed permutation on the final result, and inversing that permutation before the next shuffle. (Similar to the approach of @fgrieu).

Baumflaum · Answer 7 · 2016-10-14T15:21:35.160

What you request is basically the construction of Finite or Galois Fields.

It constructs a set(n) number of elements in wich there will never be a duplicate in a column or row, if your constructing polynomial is irreducible. Basically there is only a set number of possible combinations and that number is (correct me if I'm wrong) equal to the number of irreducible polynomials of degree n.

You don't have to calculate an entire field. You just have to chose a random polynomial, multiplicate your column-number with your row-number and apply a modulo operation of your irreducible polynomial. Needless to say your numbers are depicted by polynomials.

Edit: Since you want to compute 100 milion elements in a row this or something like this is your only out. You would need 10 petabyte * variablesize to store it.

score 0 · Answer 8 · answered Oct 14 '16 at 16:59

A block cipher is a function from keys to (cryptographically-indistinguishable-from) random permutations of the set of all numbers up to $2^{\text{block size}}$. Therefore:

Choose a block cipher whose block size is bigger than your largest input value will ever be. Choose a random key for that block cipher. Encrypt each integer in your input sequence as a single block; interpret each output block as an integer; the sequence of output integers is the sequence you want. Change the key to generate a new sequence.

This will be fast and secure. However, you will almost surely get output values that are bigger than the largest input value, unless your upper limit is in fact equal to $2^{\text{block size}}$. It is not clear to me whether this is a problem.

How to convert a set of sequential integers into a set of unique random numbers?

8 Answers8

Linked