10

Suppose we use a hash function $H$ to hash $N$ distinct balls into $M$ distinct bins. Assuming simple uniform hashing, what is the expected number of collisions?

Note that a collision is defined by adding a ball to an already occupied bin. If the already occupied bin has $k$ balls in it, then the number of collisions upon adding a new ball is $k.$


By using expectation, I tried as :

=> 1 × Probability of collision in first insertion +

2 × Probability of collision in second insertion + .......... +

n × Probability of collision in nth insertion

=> $(1 ∗ 0) + (2 ∗ 1/m) + (3 ∗ 2/m) + (4 ∗ 3/m) + … + (n ∗ n−1/m)$


Actually, The answer is $(n^2 - n)/2m$


But, I am not getting the answer. Where am I wrong here ?

cats
  • 4,408
Jon Garrick
  • 2,674
  • note that it should be $n\times P(\textrm{n collisions}),$ not collision on the $n$th insertion – cats Dec 09 '16 at 15:09
  • @cats Sorry, didn't get it. Could you please explain it as answer ? – Jon Garrick Dec 09 '16 at 15:10
  • I'm just saying that your definition of expectation is incorrect. You need $1\times P(\textrm{1 collision}) + 2\times P(\textrm{2 collisions}) + \ldots ,$ which is not the same as what you've written – cats Dec 09 '16 at 15:12
  • @cats okk !! When I write "2 × Probability of collision in second insertion" then it means that for 2 collisions to happen, what is the probability ? Similarly, for 3 collisions to happen what is the probability? But instead, If I write "1 × Probability of collision in second insertion" then it means probability of having one collision in second attempt. Similarly, what is the probability of having one collision in 3rd attempt ? Am I right now ? – Jon Garrick Dec 09 '16 at 15:26
  • No, $2\times P(\textrm{collision in second insertion})$ means $2$ times the probability that a collision occurs in the second insertion, which includes the case where no collision occurred in the first insertion, – cats Dec 09 '16 at 15:27
  • what do we mean by "collisions"? number of bins with 2+ balls? If all balls end up in the same bin how many collision is there? – Kazz Dec 09 '16 at 15:29
  • 1
    Collisions probably means the number of times you put a ball into an already occupied bin – cats Dec 09 '16 at 15:30
  • If you say "you're not getting the answer" you know what the answer suppose to be? Would you share with us? – Kazz Dec 09 '16 at 17:31
  • @Kazz Plz check again. I hv updated !!! – Jon Garrick Dec 09 '16 at 17:45
  • @cats If possible ! Could you please answer. It would be very helpful . – Jon Garrick Dec 09 '16 at 17:54
  • The answer you've given seems to say that a "collision" is actually counted differently than what we've said. Namely, that if there are $k$ balls already in a bin, then a new ball actually adds $k$ collisions – cats Dec 09 '16 at 20:04

4 Answers4

5

Note that if you end with $k$ occupied bins, then there were $N-k$ collisions. In other words, we want $N$ minus the expected number of occupied bins. This is easy - the probability a bin is unoccupied at the end is $\left(\frac{M-1}{M}\right)^N,$ so the expected number of unoccupied bins is this times $M$ and the expected number of occupied bins is $M\left(1-\left(\frac{M-1}{M}\right)^N\right),$ so our answer is $N$ minus this.

edit: this answers a different version of the problem. see paw's answer for the updated :)

cats
  • 4,408
  • I enjoy this succinct explanation for the more common problem. Ironically, I could not find as good an explanation for the more common problem. I think it is good to have answers to both questions on the same one – Charlie Tian Jan 30 '18 at 09:08
5

The expected number of new collisions caused at the time of inserting the $k$-th ball is $\frac{k-1}{M}$ since it has a $\frac1M$ collision probability with each ball already placed.

Thus the expected number of collisions is

$\frac0M+\frac1M+\frac2M+\cdots+\frac{N-1}{M}=\frac{N(N-1)}{2}\cdot\frac1M$

paw88789
  • 41,907
  • 3
    This is wrong. If there was already collisions then there are less spots of having collisions with. You need a recurrent definition and maybe have to do this with recursion – Charlie Tian Jan 30 '18 at 08:45
  • The recursive relationship I derived was that E(k, M) = (M-1/M)E(k-1) + (k-1)/M – Charlie Tian Jan 30 '18 at 08:59
  • My bad, I had read the question wrong. My way is the recursive DP way to do it when collision is just 1. This is for general k – Charlie Tian Jan 30 '18 at 09:07
  • I'm not able to digest the argument that the probability of collision at the kth insertion is (k-1)/M. E.g. Take case of 3rd insertion. Prob of collision in the 3rd insertion has two cases. One 1st and 2nd insertion collided and second they don't collide. If we are in case 1, prob of collision for 3rd insertion is 1/M. If we are in case 2, prob of collision in 3rd insertion is 2/M (since 1st and 2nd balls are in different bins). Now case 1 and 2 are mutually exclusive,therefore total prob of collision in 3rd insertion should be 3/M. Pl clarify. – CKM Mar 02 '21 at 05:28
  • @CKM We're looking at the expected number of new collisions here, not the probability of a new collision. If we are in your case 1, if we have a collision, we add 2 to the number of collisions since there were already 2 things in the bin. This is in fact what the OP was asking about. – paw88789 Mar 02 '21 at 10:23
0

you are claiming that $P(A \cap B)=P(A)+P(B)$, which is true iff the events are independent and they are not. Consider that when you insert 3rd element "Probability of collision in this insertion" = $1 \over m$ if there was a collision on 2nd insertion (2nd end up in the same bin as 1st) but it's $2 \over m$ is there was not, so $P_n$ depends on $P_1, \ldots ,P_{n-1}$

Kazz
  • 352
-1

assuming @cats statment that "Collisions probably means the number of times you put a ball into an already occupied bin"

you can approach this by describing steps of the process and thinking about graph of states.

like: if while counting $c$ collisions so far and having $i$ balls left there are $k$ empty bins left then with probability $p={1 \over k}$ you move to the state "$i-1$ balls ; $k-1$ empty bins; $c$ collisions" and with probability $1-p$ you move to state "$i-1$ balls; $k$ empty bins; $c+1$ collisions".

In short:

Let $S(c,i,k)$ be state described above, the state graph looks like this:

$S(0,N,M)$ - start state

$S(c,0,k)$ - end stats (dla dowolnych $c,k$)

and the edges are:

$S(c,i,k)\begin{matrix}\overset{1 \over k}{\rightarrow} S(c,i-1,k-1)\\ \overset{k-1 \over k}{\rightarrow}S(c+1,i-1,k)\end{matrix} for\, i>0$

now Lets define $p_{cik}$ as probability of reaching state $S(c,i,k)$ and variable $\Bbb X$ which will be a number of collisions.

If we ask "from what state can i move into state $S(c,i-i,k-1)$", the answer will be "We where in $S(c,i,k)$ and we hit an empty bin or we where in $S(c-1,i,k-1)$ and we made a collision"

That gives us a recursive function: $$p_{c\,i-1\,k-1}={1 \over k}p_{c-1\,i\,k}+{k-1 \over k}p_{c-1\,i\,k-1}$$ And additionally we can say: $$p_{cik}=0\,for\,\begin{matrix}i \notin \{0,\dots,N\}\\ k \notin \{0,\dots,M\}\end{matrix}$$ since those are not possible and that $$p_{0NM}=1$$

that is enough to calculate $p_{cik}$ for $(i,k) \in \{0,\dots,N\} \times \{0,\dots,M\}$

Now we can say $$P(\Bbb X=x)=\sum_{c_n,i,k} p_{c_nik}$$ where $0\le i \le N$ , $0\le k \le M$ , $\sum_nc_n=x$

Such model could be implemented in any programing language.

Kazz
  • 352