5

I would like to use a data structure allowing fast access, either a balanced binary search tree (BST) for $O(\log n)$ access time or an open hash table for constant access time.

1) What is the exact memory usage of a BST or hash table, for storing $n$ values? For instance, if we consider a BST, are $n$ pointers sufficient for storing it? (We can neglect the size of the stored values. The only thing that interests me is the storage overhead involed by the use of a specific data structure)

2) If the choice has to be determined by the cost in space, subject to the constraint of a fast enough access time, what is the best data structure to use?

For the space cost criteria, I would like a precise description. I'm interested in two types of usage: static and dynamic. I ask this question in the context of a C implementation.

I'm mostly interested in values around $n=100000$.

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184
user7060
  • 475
  • 5
  • 12

3 Answers3

9

When you're asking about "exact" memory usage, do consider that all of those pointers may not be necessary. To see why, consider that the number of binary trees with $n$ nodes is $C_{2n}$, where:

$$C_i = \frac{1}{i+1} { 2i \choose i }$$

are the Catalan numbers. Using Stirling's approximation, we find:

$$\log C_{2n} = 2n - O(\log n)$$

So to represent a binary tree with $n$ nodes, it is sufficient to use two bits per node. That's a lot less than two pointers.

It's not too difficult to work how how to compress a static (i.e. non-updatable) binary search tree down to that size; do a depth-first or breadth-first search, and store a "1" for every branch node and a "0" for every leaf. (It is harder to see how to get $O(\log n)$ access time, and much harder to see how to allow updates to the tree. This is an active research area.)

Incidentally, while different balanced binary tree variants are interesting from a theoretical perspective, the consistent message from decades of experimental algorithmics is that in practice, any balancing scheme is as good as any other. The purpose of balancing a binary search tree is to avoid degenerate behaviour, no more and no less. Stepanov also noted that if he'd designed the STL today, he might consider in-memory B-trees instead of red-black trees, because they use cache more efficiently. They also use $n + o(n)$ extra pointers to store $n$ nodes, compared with $2n$ or $3n$ for most binary search trees.

As for hash tables, there is a similar analysis that you can do. If you are (say) storing $2^n$ integers in a hash table from the range $[0,2^m)$, and $2^n \ll 2^m$, then it is sufficient to use

$$\log {2^m \choose 2^n} \approx (m-n)2^n$$

bits. It is possible to achieve close to this bound using hash tables.

To give you the basic idea, consider an idealised hash table where you have $2^n$ elements stored in $2^n$ slots (i.e. load factor of one where every "chain" has length one).

If you hash $m$ bits of key into $m$ bits of hash, then store this in a $n$-bit hash table, then $n$ bits of the hash are implied by the position in the hash table, and you therefore only need to store the remaining $m-n$ bits. By using an invertible hash function (e.g. a Feistel network), you can recover the key exactly.

Of course, traditional hash tables have Poisson behaviour, so you would need to use a technique like cuckoo hashing to get close to a load factor of one with no chaining. See Backyard Cuckoo Hashing for further details.

So if space usage is a far more important factor than time (subject to time being "good enough"), it may be worth looking into this area of compressed data structures, and succinct data structures in particular.

Pseudonym
  • 24,523
  • 3
  • 48
  • 99
2

First of all BST has not $\Omega(logn)$ access, it has $\Omega(n)$. What you really need is AVL or RBT tree, self balancing trees to maintain logarithmic access.

1) BST you have to use two pointers (to left and right child) and one variable for data. This is whole footprint. Additionally you have to assign root to some variable. For AVL you may add height of left and right subtree (so like BST plus two integers), or one small integer to encode balance. In fact you have to choose what is footprint of one node, this is in tradeoff to speed of balancing operation. What auxiliary data you encode (or choose between AVL and RBT) is determined on ratio of insertions to searches. If smaller footprint is needed rebalancing operation will execute more steps. If you have all data in advance, you can construct BST with $\Omega(logn)$ access, if there will be no changes to structure later.

Hash table with open addressing is described here: http://www.algolist.net/Data_structures/Hash_table/Open_addressing Instead of creating linked list in case of collision it searches for free bucket, so there is no linked list needed.

2) Assuming you have everything in advance, and it is simple as search only - perfect hashing will give constant time access and no need for pointers at all, it will be simple array of your type and hashing function as getter. For example: http://cmph.sourceforge.net

BST is 2n pointers + n values. AVL is 2n pointers + n or 2n integers + n values. RBT is 2n pointers + n values + n colors (boolean). As you decided to open hash table (open adressing strategy), you have m pointers (m is > n, and biger is better) and n values. As open adressing is not degrading into linked lists but moves data to next empty cell. I assume that you create dynamically data structures, so empty node is only null pointer, so are leaves in trees. But even if some node is leaf, it still encodes 0 height to both childs.

Evil
  • 9,525
  • 11
  • 32
  • 53
0

As far as I have understood your question, you are asking just about the size requirement for both algorithms.

  1. If you are using BST (Binary Search Tree) the size of your array depends on the number of elements you have. If it is a balanced tree, then you may require maximum of n+2 elements. But if it is unbalanced or skewed tree then you may be in need of n*2 elements.
  2. If you are using hash table, there is a factor called load factor. If you know the maximum limit of your elements, that is n, then you can keep the size of your array as n and you'll get maximum efficiency with LF=1.0. If you don't know the size of your elements (it may increase in future), you can initialize it with a L.F.=0.75. Your final size may be x=n/0.75.

One benefit with hash table is that you can fill all the locations available which is not the case with BST. In BST, an element can have only one appropriate place. Whereas, in hash table, you can insert the element in any next free available space.

For hash table, you can refer to this Chosing a suitable table size for a Hash