7

If we sample $N \gg n$ binary strings of length $n$, uniformly and independently, what is the expected minimum Hamming distance between the closest pair?

It seems likely this is hard to compute exactly so a good approximation would be very welcome.

This related question asks specifically about the case where $N = 3$. In a very nice answer @joriki sets out that the mean minimum Hamming distance in this case is approximately:

$$ \boxed{\frac n2-\frac34\sqrt\frac n\pi}\;. $$

In order to have a concrete value to aim at:

  • If $N=2^{12}$ and $n=50$ then the mean minimum Hamming distance is approximately $7.3$.
  • If $N=2^{12}$ and $n=64$ then the mean minimum Hamming distance is approximately $11.7$.
  • If $N=2^{14}$ and $n=50$ then the mean minimum Hamming distance is approximately $5.9$.
  • If $N=2^{14}$ and $n=150$ then the mean minimum Hamming distance is approximately $40.2$.
  • Yes-- for clarity, I deleted and rewrote earlier comment: There's $r$ strings. What you have is $:=\binom{r}{2}$ integer valued r.v.'s (or equivalently dependent vectors each with bernouli components and we sum over the components), so ${1,_2,...,}$ and you want EV of the minimum. You should realize its sufficient to estimate the maximum, and then use $[\max X_i]≤+\cdot \int_c^\infty (>)$. Any $c \geq 0$ will work but this is optimized with $^∗$ such that $⋅(>c^∗)=1$. The (nonoptimzed but nearly tight) bound is easy for r=3; messier for $r\gt 3$. – user8675309 Jan 27 '20 at 21:24

1 Answers1

4

Maybe this helps: from the additive form of the Chernoff bound, for any fixed pair of strings $X,Y\in \{0,1\}^n$ generated uniformly and independently, one has for any $t\geq 0$: \begin{equation} \Pr(\text{dist}(X,Y)\leq n/2 -t)\leq \exp\bigg(\frac{-2t^2}{n}\bigg). \end{equation} This is just because $X$ and $Y$ are random, so their distance is just a sum of $n$ Bernoulli random variables indicating whether or not the $i$th bit disagrees (and these are all independent for a fixed pair). Applying this with $t=\sqrt{(3/2)n \ln(N/2^{1/3})}=\Theta(\sqrt{n\ln N})$ gives that the probability that a fixed pair deviates from $n/2$ by at least this $t$ is at most $2/N^3$ (I kept the constants inside in case you care about those rather than just the asymptotics). By a union bound over all ${N \choose 2}\leq N^2/2$ pairs, this implies that the probability that any pair deviates by this amount is at most $1/N$.

Note that the maximum deviation from $n/2$ is exactly $n/2$ (if you get the same string twice). So we can conclude that if $X_1,\ldots,X_N\in \{0,1\}^n$ are the sampled strings, \begin{align*} \mathbb{E}[\min_{i\neq j} \text{dist}(X_i,X_j)]&\geq (n/2-t)\Pr(\exists i\neq j: \text{dist}(X_i,X_j)-n/2\geq -t)\\ &\geq (n/2-\Theta(\sqrt{n\ln N}))(1-1/N)\\ &=n/2-\Theta(\sqrt{n\ln N})-o(1), \end{align*} where we use $N>>n$ to absorb the small negative contribution of the terms that are at most $O(n/N)$. I leave it to you to keep track of the constants if you need them...

On the other hand, this is basically the right scale: here's a sketch. By the central limit theorem, \begin{equation} \Pr(\text{dist}(X,Y)\leq n/2 -t\sqrt{n})\sim \frac{1}{\sqrt{2\pi}}\int_t^{\infty} e^{-x^2/2}dx\sim e^{-t^2/2}. \end{equation} For $t\sim \sqrt{\ln N}$, this is approximately $1/\sqrt{N}$. Now, by splitting our $N$ random variables into $N/2$ disjoint pairs, we have \begin{align} \Pr(\exists i\neq j: \text{dist}(X_i,X_j)\leq n/2-t\sqrt{n})&\geq \Pr(\exists 1\leq k\leq N/2:\text{dist}(X_{2k-1},X_{2k})\leq n/2 -t\sqrt{n})\\ &=1-\Pr(\forall 1\leq k\leq N/2:\text{dist}(X_{2k-1},X_{2k})> n/2 -t\sqrt{n})\\ &= 1-\Pr(\text{dist}(X,Y)> n/2 -t\sqrt{n}))^{N/2}, \end{align} as we note that by splitting, these events are independent. In particular, this probability is lower bounded as \begin{equation} \Pr(\exists i\neq j: \text{dist}(X_i,X_j)\leq n/2-t\sqrt{n})\succeq 1-(1-1/\sqrt{N})^{N/2}\geq 1-\exp(-\sqrt{N}), \end{equation} where I am using $\succeq$ to denote where I used some sort of asymptotically equal approximation (i.e. CLT). In particular, with overwhelming probability, there exists a pair whose pairwise distance is at most $n/2-\Theta(\sqrt{n\ln N})$. I leave it to get tight constants and/or make this sketch fully rigorous.

  • If I set $n=64$ and $N=2^{12}$ and compute $n/2 - \sqrt{(3/2)n \ln(N/2^{1/3})}$ I get approx $4.14$. This is quite a long way from $11.6$. Is it possible to get a closer estimate? –  Jan 27 '20 at 22:25
  • @fomin I mean, maybe? Computing the suprema of dependent random variables is analytically difficult, so you need to use union bounds/Chernoff bounds/CLTs etc. This will basically always lose some factors. This was to show you how the mean scales asymptotically with $n,N$. – Jason Gaitonde Jan 28 '20 at 15:08