4

Problem: Given a set of points $S = \{x_1, x_2, x_3, ..., x_n\}$ from $\mathbb{R}^m$ and an offset vector $v \in \mathbb{R}^m$, find a set $Z \subseteq S \times S$ containing $k$ pairs of points $(x_i, x_j)$ such that the quantity $|x_i-x_j-v|$ (Euclidean norm) is smaller for any pair $(x_i,x_j) \in Z$ than that of any pair not in $Z$.

One obvious approach would be to maintain a max-heap of size $k$ and run through all pairs of points $(x_i, x_j) \in S^2$, inserting pairs into the heap if $|x_i-x_j-v|$ is smaller than the current maximum. This algorithm has $O(mn^2 \log k)$ running time.

Is there a faster algorithm? Is there a lower bound on the complexity of this problem?

This problem is motivated by an application where $50 \leq m \leq 1000$, $10^5 \leq n \leq 10^6$, and $5 \leq k \leq 100$, and the running time should be in seconds, not minutes or hours (which is I the naive approach above is not applicable).

Alexandre
  • 349
  • 2
  • 10

1 Answers1

4

One optimization I would propose is over the brute force search:

$$ \begin{align*} d(\mathbf{x}_i, \mathbf{x}_j) &= \lVert (\mathbf{x}_i-\mathbf{x}_j) - \mathbf{v} \rVert^2\\ &= \sum\limits_{k=1}^N (x_i^k - x_j^k-v^k)^2\\ &= \sum\limits_{k=1}^N ((x_i^k-x_j^k)^2+(v^k)^2-2v^k(x_i^k-x_j^k))\\ &= \sum\limits_{k=1}^N ((x_i^k)^2+(x_j^k)^2-2x_i^kx_j^k+(v^k)^2-2v^k(x_i^k-x_j^k))\\ \end{align*} $$ as $(v^k)^2$ is the same for all pairs, we could simply drop it - doesn't effect minimization.

\begin{align*} d(\mathbf{x}_i, \mathbf{x}_j) &= \sum\limits_{k=1}^N ((x_i^k)^2+(x_j^k)^2-2x_i^kx_j^k-2v^kx_i^k+2v^kx_j^k)\\ &= \sum\limits_{k=1}^N (x_i^k)^2 + \sum\limits_{k=1}^N(x_j^k)^2 - 2\sum\limits_{k=1}^N x_i^kx_j^k - 2\sum\limits_{k=1}^N v^kx_i^k + 2\sum\limits_{k=1}^N v^kx_j^k\\ \end{align*}

Let's go back to matrix notation:

\begin{align*} d(\mathbf{x}_i, \mathbf{x}_j) &= \lVert \mathbf{x}_i \rVert+\lVert \mathbf{x}_j \rVert - 2(\mathbf{x}_i \cdot \mathbf{x}_j)- 2(\mathbf{x}_i \cdot \mathbf{v}) + 2(\mathbf{x}_j \cdot \mathbf{v})\\ \end{align*}

Note that all the terms, except the middle one is free of the pairwise computations and can be computed in $O(N)$ time and stored. To compute $(\mathbf{x}_i \cdot \mathbf{x}_j)$, one can assemble matrix $X$, which contains $\mathbf{x}_i^T$ at each row and compute $D=XX^T$. Each element in this huge symmetric matrix, would then give you the dot product per pair: $D(i,j)=(\mathbf{x}_i \cdot \mathbf{x}_j)$. If memory is of concern, you can simply revert to iterative computation and not store the intermediate dot products. All in all, this would save a lot of time in pairwise comptutations, speeding up the entire search. I assume that you could couple this easy to implement approach with any other optimization to further boost the performance. In all the calculations I omitted $sqrt$ because it doesn't influence the relative comparison of distances.

If the assumption is that $\mathbf{v}=\mathbf{0}$ ($\mathbf{v}$ is null), the entire procedure boils down to a fast computation of distance matrix - this view might benefit certain applications.

Tolga Birdal
  • 537
  • 3
  • 14