3

Let $S$ be a set (say positive integers $\leq$ N) and $f$ an involution ($f$ is bijective, $f\cdot f=id$, e.g. xor with a constant). $g$ is a idempotent mapping choosing an arbitrary representative element in each $f$ mapped pairs. For example $g(x)=\min (x,f(x))$ $$g: S \rightarrow \tilde S \subseteq S$$

I want to build a compact lookup table from $g$'s codomain $\tilde S$ to any problem data, taking $|\tilde S| \leq |S|$ cells in memory. Ideally I wish to construct a bijective mapping $\tilde S \rightarrow \left[0, |\tilde S| \right[$.

Can it be done efficiently in general (without hash map or scanning) ? What properties of the involution $f$ could help with that ?

Edit: I formulated the problem in the more general/formalized, hoping for a generic solution. Following D.W.'s comment, I'll give a concrete application:

I work with DNA words of $k$ nucleotides bases called $k$-mers. Since there is four bases, $k$-mer are represented as elements of $S=[0,2^{2k}[$

However DNA can be read on both strands, with opposite orientations and complementary bases ($A \leftrightarrow T$, $G \leftrightarrow C$). Going from one strand to the other can be represented by this involution (reverse-complement, here for 5-mers): $$ f(x) = \text{reverse}_2(x) \oplus 0\text b1010101010 $$ where $\text{reverse}_2(abcdefghij)=ijghefcdab$ inverse the order of 2-bits blocks and $\oplus$ is the bitwise XOR.

Since many applications don't distinguish between a $k$-mers and its reverse-complement, a canonical $k$-mer is picked with $g(x)=\min (x,f(x))$. The cardinality of $g$'s co-domain is: $$ \left|\tilde{S}\right|=\begin{cases} 2^{2k-1} & \text{if }k\text{ is odd}\\ 2^{k-1}\left(2^{k}+1\right) & \text{if }k\text{ is even} \end{cases} $$

In practice saving less than one addressing bit is not worth a complex solution. But cache locality is a good thing to have. $g$ can be chosen differently if it help with that.

Piezoid
  • 133
  • 4

1 Answers1

3

I can give two candidate solutions for your specific situation.

Approach #1: parity

This only works if $k$ is odd. Notice that if $k$ is odd, the parity of $f(x)$ is the reverse of the parity of $x$. In other words, the xor of the bits of $f(x)$ is the complement of the xor of the bits of $x$.

This suggests an encoding. Define $g(x)$ to choose between $x$ and $f(x)$ by always choosing the one with even parity. In other words, $g(x)=x$ if the the bits of $x$ xor to zero, otherwise $g(x)=f(x)$. Now we have $y \in \{0,1\}^{2k}$ with even parity, and we want to map it to a $2k-1$ bit index/offset. We can do that by simply truncating the last bit of $y$ (since we know $y$ has even parity, the last bit of $y$ is uniquely determined by all the other bits).

This scheme is simple, easy to implement, efficient, and achieves perfect compression of your table. Unfortunately this doesn't work if $k$ is even. I don't know if there is a simple generalization of this idea to the case where $k$ is even.

Approach #2: multiple tables

This scheme works for arbitrary $k$. Instead of having a single lookup table, we will have $k/2$ lookup tables $T_1,T_2,\dots,T_{k/2}$, each of which stores a different subset of the space of $k$-mers. The mapping will have the property that $x$ and $f(x)$ map to the same table and same offset/index within the table.

Define $h(x)$ to be the smallest $i$ such that $f(x_i x_{k+1-i}) \ne x_i x_{k+1-i}$ (here by $x_i$ I mean the $i$th base in $x$, where $i \in \{1,2,\dots,k\}$; so $x_i$ is two bits). Now we will store $x$ in the table $T_j$ where $j=h(x)$. For instance, if $h(x)=1$, we'll store $x$ in $T_1$; if $h(x)=2$, we'll store $x$ in $T_2$; and so on. Conveniently, we have $h(x)=h(f(x))$. So, we'll define $g(x)$ to compute $j=h(x)$ and then choose between $x$ and $f(x)$ by choosing the value $y$ such that $y_j < f(y_j)$. In other words, if $x_{h(x)}$ is lexicographically smaller than $f(x)_{h(x)}$, then $g(x)=x$, otherwise $g(x)=f(x)$.

How do we determine the index into the table? Suppose $h(x)=1$. Then there are 12 possibilities for $x_1,x_k$ such that $h(x)=1$ (the other 4 possibilities yield $h(x)>1$), and our definition of $g$ picks out 6 of them as canonical. So, in total there are $6 \times 2^{2k-4}$ possible values of $x$. Thus $x$ can be easily converted into an index into $T_1$ by mapping $x_1,x_k$ to $\{0,1,2,3,4,5\}$, say $i_1$, and mapping $x_2,\dots,x_{k-1}$ to an integer in the range $0..2^{2k-4}-1$, say $i_2$, and then using $2^{2k-4} i_1 + i_2$ as our index. You can generalize to the cases $h(x)=2$, $h(x)=3$, etc. For example, suppose $h(x)=2$. Then there are 4 possibilities for $x_1,x_k$; 12 possibilities for $x_2,x_{k-1}$, but $g$ picks out 6 of them as canonical; and $2^{2k-8}$ possibilities for $x_3,\dots,x_{k-2}$. So you can map $x_1,x_k$ to an integer in the range $0..3$, map $x_2,x_{k-1}$ to an integer in the range $0..5$, and map $x_3,\dots,x_{k-2}$ to an integer in the range $0..2^{2k-8}-1$; then map those to an index in the range $0..4 \times 6 \times 2^{2k-8}-1$. And so on for larger values of $h(x)$.

This strategy works for any value of $k$. It achieves perfect compression. It's not too difficult to implement and should be pretty fast.

D.W.
  • 167,959
  • 22
  • 232
  • 500