4

I want to try and use a succinct type data structure in order to find the amount of occurences of a letter $C$ in a string $S$ until a given index $I$.

Assume we have a string $S$ of length $n$ over the alphabet $\Sigma = \{1,2,3,\ldots,n\}$. I want to build a data structure whose space usage is at most $O(n)$ machine-size words.

Given $C$ (a valid letter from $\Sigma$) and an index $i$ such that $0 \le i \lt n$, the data structure should be able to return the number of occurrences of $C$ in $S[1..i]$ as efficiently as possible (but not necessarily in $O(1)$ time). This is the only query possible.

The data structure is aware of $S$ when created.

What I've tried to far is to split the string $S$ into blocks and to store info about each block with the number of occurrences of each letter in the previous block. I couldn't find how to make it lighter so the memory complexity be as requested.

D.W.
  • 167,959
  • 22
  • 232
  • 500
Ori Refael
  • 151
  • 5

2 Answers2

5

To put it another way, you want to implement rank queries on an arbitrary string with an arbitrary alphabet.

If $n$ is of modest size, the usual approach is to use a Wavelet tree, associating a succinct binary rank index with each node in the tree. The shape of the tree is arbitrary, but using a Huffman tree gives you a data structure that is very close to zero-order entropy compression.

When the alphabet is very large, though, the size and implementation of the tree itself becomes a concern. A way to fix this which seems to work well in practice is to store $\left\lceil \log_2 n \right\rceil$ rank-select indexes of length $|S|$, but use an index implementation that it sensitive to local variation in density, such as esp or RRR.

This is covered in Claude & Navarro's 2008 paper, Practical Rank/Select Queries over Arbitrary Sequences. There may be some more recent work than this if you chase citations, but that's a good start.

Pseudonym
  • 24,523
  • 3
  • 48
  • 99
3

Let's try a few special cases:

  • If the letters in the string are all distinct, then a permuted index would give $O(1)$ lookup with $n$ words of storage.
    • Data: an array $A$ where $A[c]$ is the position of the letter $c$ in the string.
    • Data size: Array of size $n$.
    • Query: $Q(C,I) = 1$ if $A[C] \le I$ and $Q(C,I) = 0$ otherwise.
    • Query time: $O(1)$
  • If the string only contains $O(1)$ distinct letters, then you could store the set of positions of each letter independently.
    • Data: for each letter $C$, a binary search tree of the positions of $C$ in $S$, where each node stores the weight of the subtree.
    • Data size: a forest of trees with a total of $n$ nodes.
    • Query: $Q(C,I)$ is the sum of the weights of the subtree to the left of $I$ in the positions tree for the letter $C$.
    • Query time: $O(\lg(n))$ (binary tree lookup)

That last data structure does in fact generalize to the case where the string can contain an arbitrary arrangement of letters. There are now $n$ trees, each of which could have a size up to $n$, but the total size of the forest is still only $n$, so the storage requirement is $\Theta(n)$. The query time is $O(\lg(n))$. The time it takes to set up the data structure is $O(n \lg^k(n))$ for some $k$ that I can't be bothered to calculate.

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184