3

I have a large (10 million+) set $X$ of data points in some high dimensional $\mathbb{R}^d$ ($d \geq 500$) space. Each data point is quite sparse, e.g. has around $10$ components. Every missing component can be seen as having value $-\infty$ for the purposes of this problem. Associated with each data point $X_i$ is a price $y_i$.

As a quick refresher, a Pareto order is a partial order that orders vectors by those that are 'strictly better', component-wise. That is, for $a, b \in \mathbb{R}^d$ we have

$$a \preceq b \iff \forall i : a_i \leq b_i.$$

In this case we also say that $b$ dominates $a$ (although not strictly, $a$ and $b$ could be equal). Finally note that this is very much a partial order, it very often happens that $a$ and $b$ are incomparable.

Let $D(z) = \{x \in X : z \preceq x\}$ be the elements in our dataset that dominate $z$ (taking into account only the data points, not their associated prices). I wish to know if there exists an efficient data structure that can answer two things about some $z \in \mathbb{R}^d$ (even sparser, e.g. around $4$ to $5$ components):

  1. the number of items that dominate $z$, or $|D(z)|$, and
  2. the cheapest $k$ prices among those that dominate $z$, or the smallest $k$ elements of $\{y_i : X_i \in D(z)\}$.

Note that $k$ is small here (e.g. $10$) even though $D(z)$ might contain thousands of elements.


Does a data structure solving this problem efficiently exist? Every approach I can think of suffers badly from the curse of dimensionality.

orlp
  • 13,988
  • 1
  • 26
  • 41

2 Answers2

1

Here is one approach you could consider. If the number of non-missing coordinates is tightly concentrated around 10, it might help you partly avoid the curse of dimensionality. I don't know whether it will be useful in practice.

Choose a random hash function $h:\{1,\dots,d\} \to \{1,\dots,10\}$. If $x \in \mathbb{R}^d$ is a data point, let $f(x)$ be its signature, where $f:\mathbb{R}^d \to \mathbb{R}^{10}$ is defined as $$f(x) = (x^*_1,\dots,x^*_{10})$$ where $x^*_j = \max \{x_i \mid h(i)=j\}$.

Notice that the signature is dense, i.e., $f$ maps a sparse high-dimensional vector to a dense low-dimensional vector.

Also, notice that $f$ is monotonic: if $z \preceq x$ then $f(z) \preceq f(x)$. The converse does not necessarily hold.

The approach will be to build a data structure that, given a query $z$, helps us enumerate all $z$ such that $f(z) \preceq f(x)$; then we will check each such $x$ to see whether $z \preceq x$, and count the number that do, or output the lowest-priced ones that do. This reduces the problem from a 500-dimensional problem (on sparse data points) to a 10-dimensional problem (on dense data points).

How does the 10-dimensional data structure look? We can use a simple trie, where the $i$th level branches on the value of $x^*_i$, and each leaf stores one data point. In practice, I suggest organizing the list of children at the $i$th level using a binary search tree keyed on $x^*_i$, rather than as a list.

Now, the lookup algorithm simply traverses the trie recursively, but with the traversal pruned in the obvious way. In other words, at each level we only explore the children $x^*_i$ where $z^*_i \le x^*_i$ (for query $z$). Using the binary search tree data structure, at each level it is easy to enumerate only those children without having to enumerate the other children.

What is the running time of this algorithm? The worst-case time could be bad, but I'll analyze the average-case running time, via an extremely crude heuristic. If $z,x$ are two randomly chosen data vectors, then crudely $\Pr[z^*_i \le x^*_i] \approx 1/2$ for each $i$, as we have two randomly chosen numbers from $\mathbb{R}$ and it is roughly equally likely which one is larger. Therefore, we can expect that only about a $1/2^{10}$ fraction of the data points $x$ will satisfy $f(z) \preceq f(x)$. And, the running time of the recursive traversal of the trie will be approximately proportional to the number of such data points $x$. Therefore, this heuristic predicts the average-case running time of this algorithm, on a random query $z$, to be something like $O(|X|/2^{10})$ time. In other words, this is approximately a 1000-fold speedup over the naive algorithm of enumerating all data points in your dataset. This crude analysis is overly optimistic and probably the true running time will be worse (e.g., due to collisions in the hash function, as this analysis implicitly assumed that both $z$ and $x$ have exactly 10 components and there are no collisions in the hash function, but in practice neither of those will always be true).


P.S. There are multiple variants possible. We can also consider replacing 10 by an arbitrary $d'$ and optimizing over $d'$. Also, we can alternatively define $f$ by $$f(x) = (x^*_0,x^*_1,\dots,x^*_{d'})$$ where $x^*_0$ is the number of $x^*_1,\dots,x^*_{d'}$'s that are not $-\infty$. I don't know whether either of those would be better, but they could be variants to try on your data set.

Another possible optimization is to precompute a dozen different copies of the data structure for a dozen different hash functions. Then, when you want to answer a query for $z$, check which hash function maximizes the number of coordinates in $f(z)$ that are not $-\infty$, and use the corresponding data structure for the lookup. If $x$ has exactly 10 non-missing coordinates, then it is very likely that one of these hash functions leads to a signature with only 0, 1, or 2 coordinates at $-\infty$. Also you could consider an approach where $d'$ is large, say $d'=20$; then there is a good chance that there will be one hash function that does not introduce any collision... though I'm not sure what the effect of increasing the dimension like that might be on running time.

D.W.
  • 167,959
  • 22
  • 232
  • 500
0

Your problem can be also stated as a counting version of the orthogonal range searching problem which is well studied in computational geometry. Let us see how.

Let $m_{i}$ denote the maximum $i^{th}$ coordinate value among all the points in $X$. Then, the query of finding $|D(z)|$ can be stated as follows:

Given an axis-parallel rectangle $R$ with the bottom left coordinate as $z$ and the top right coordinate as $(m_{1},m_{2}, \dotsc,m_{d})$, find the number of points in $X$ that lie inside $R$.


However, this does not answer your question accurately since in your scenario we have an added advantage that the vectors are sparse. This property can surely exploited to design more efficient algorithms.

Inuyasha Yagami
  • 6,277
  • 1
  • 12
  • 23