11

I can't understand why the Java HashMap load factor is 0.75. If I understand well, the formula for the load factor is n/m, where n is the number of key and m is the number of position in the hash table.

Since HashMap utilize bucket (i.e., a linked list) to store the value it is possible to have a load factor > 1 without problem, so it's not clear for me why the load factor is set to 0.75.

Bender
  • 367
  • 2
  • 11

2 Answers2

18

I don't know the answer, but I can walk you through what might be going through the mind of someone designing such a data structure.

Assuming a "good" hash function, and that $n$ is large enough, then for a given load factor $\lambda$, the probability that a bucket contains exactly $k$ elements in it is given by the Poisson distribution:

$$P(k,\lambda) = \frac{e^{-\lambda} \lambda^k}{k!}$$

So, for example, if $\lambda=1$, then the proportion of buckets containing exactly one element (i.e. those for which no chaining is needed; I'm not sure exactly how Java's HashMap actually implements it) is:

$$P(1,1) = \frac{1}{e} \approx 36.8\%$$

For $\lambda = 0.75$, it is:

$$P(1,0.75) \approx 35.4\%$$

Not much difference. However, this is measured per bucket, not per item. The proportion of successful queries that do not require traversing a chain (and is therefore loop-free) is $P(1,\lambda)/\lambda$. For $\lambda=1$, that is $36.8\%$, but for $\lambda=0.75$, that is $47.2\%$.

So by reducing $\lambda$ from $1$ to $0.75$, the proportion of successful queries that do not need to traverse a chain increases from a little over a third to a little under a half.

But it might use more memory. How much more?

I am going to assume some kind of optimised representation where the hash table is an array of $n$ words, an empty bucket is represented as a null pointer, a bucket with a single element is represented as a pointer to the object, and a bucket with more than one element is represented as a pointer to a linked list. I will assume that a linked list node is 3 words in size: the object header (which Java semantics requires), a pointer to the item, and a "next" pointer.

So you can think of the overhead cost of an empty bucket as being $1$ word, a bucket with one item in it being $1$ word, and a bucket with $k>1$ items in it being $1+3k$ words.

I'll let you work out the details for yourself, by if $\lambda=1$, I calculate the overhead to be about $2.896$ words per stored item, and for $\lambda=0.75$, it's about $2.916$ words per stored item. That's less than $0.7\%$ difference in memory overhead.

(Exercise: Under these assumptions, for what value of $\lambda$ is the memory overhead minimised? I will say that it's greater than $\frac{1}{2}$ and less than $1$. It's not $0.75$; if it was, that would be part of my answer!)

So it does seem that under these assumptions, you would get significantly more efficient lookup and only pay a tiny increase in memory overhead. And that's just one of many tradeoffs that you could look at.

My guess is that the Java library implementers performed some back-of-the-envelope calculations similar to this, and some experiments to be certain, and found that this was acceptable.

Pseudonym
  • 24,523
  • 3
  • 48
  • 99
1

Using linked-list buckets is not the only way to resolve collisions. There is another technique: rehashing. Basically, when a collision occurs, an attempt is made to find a different, unoccupied slot. One way is linear probing - basically, the index of the slot is incremented by a fixed constant (modulo the table size) until an open slot is found. A better way is quadratic probing - basically, a counter 'i' is incremented, and the i-squared'th slot (modulo the table size) is checked, until an open slot is found. These tactics are very effective, at least until the table fills up. The load factor determines when the table has to be enlarged so that rehashing will continue to be effective. Possibly some Java hash class uses this technique.

PMar
  • 11
  • 1