4

I have two datasets $A$ and $B$:

$A = \{1, 5, -34, 34.34323, -23.5444, 9.43, ....\}$ $B = \{-0.5, 44, -0.243, -3455, 2.22221, -23.23, ....\}$

Those datasets contain a big amount of "random" numbers (They are actually not random, but their source does not matter for this question). Both datasets always have the same size.

Now I need to calculate two numbers $(u,v) \in \mathbb{R}$ so that $\sum (a_i + b_i)$ with $ a_i \geq u$ and $b_i \geq v$ reaches the biggest possible value.

For clarification: Say $a_3 \geq u$ but $b_3 < v$. Then $(a_3 + b_3)$ is not included in the sum. That is meant by $ a_i \geq u$ and $b_i \geq v$.

Does an algorithm for the solution to this problem already exist? If not, I am open for ideas.

Bobface
  • 333
  • 3
  • 9

2 Answers2

4

For each pair of elements, $(a_{i},b_{i})$, you can calculate the sum of all the $(a_j+b_j)$ where $a_{j}$ and $b_{j}$ are both greater than or equal to $a_{i}$ and $b_{i}$. So an algorithm could look like:

largest_pair = 0
largest_value = -∞
for each i:
    for each j:
        total = 0
        if(a[j] >= a[i] & b[j] >= b[i])
            total += a[j] + b[j]
    if(total>largest_value)
        largest_value = total
        largest_pair = i
u = a[largest_pair]
v = b[largest_pair]
output u,v

This algorithm is $O(n^2)$ but is easily adaptable to n-variables, and there should be some shortcuts that could speed it up

Aidan Connelly
  • 266
  • 2
  • 6
2

This algorithm can be solved in $O(n \log^2 n)$ time. The basic approach is to use Aidan Connelly's algorithm, except we'll speed up the inner loop so the inner loop takes only $O(\log n)$ time instead of $O(n)$ time. The solution involves a sweepline algorithm, persistent data structures, and other tricks.


Prefix sums. First, as a warmup, let's note that we can build a data structure that given a set $Y=\{y_1,\dots,y_n\}$ of points supports the following operations:

  • PrefixSum($y$): given $y$, return $\sum_{y_i \ge y, y_i \in Y} y_i$.

  • Insert($y$): given $y$, add it to the set $Y$ of points stored in the data structure.

I'll show how to perform both queries in $O(\log n)$ time.

In particular, we can store the points $Y$ in a balanced binary search tree with each point $y_i$ in a leaf of the tree. We store numbers in the nodes of the tree. Initially, each leaf receives a number corresponding to the point stored in it, and each internal node is 0.

When we run Insert($y$), we insert a new leaf for the point $y$ (in the correct location, so the leaves are in sorted order), then we find a collection of disjoint subtrees whose union is the set of leaves $y_i$ where $y_i<y$, and add $y_i$ to the root of each of those subtrees. Notice that any such consecutive interval of leaves can be covered by $O(\log n)$ disjoint subtrees, so the Insert operation runs in $O(\log n)$ time.

To answer PrefixSum($y$), we sum all the numbers on the path from the root to the leaf for $y$ and return that sum. Since it's a balanced tree, this also runs in $O(\log n)$ time.


Persistence. Now we can turn this data structure into a (partially) persistent data structure, using standard path copying techniques.

Now, given a version number (timestamp) $t$ and a number $y$, we can compute the answer to PrefixSum($y$) for version $t$ of the data structure. This can be done in $O(\log n)$ time. Also, we can perform the operation Insert($y$) on the latest version ($t$), obtaining a new version ($t+1$), in $O(\log^2 n)$ time. After doing $n$ inserts, the final data structure will have size $O(n \log n)$.


Sweepline. Now, let's apply this infrastructure to your problem. Assume the points $(a_i,b_i)$ are sorted first by increasing $a_i$ and then ties resolved by increasing $b_i$. We'll sweep leftwards, from $(a_n,b_n)$ to $(a_1,b_1)$; when we visit $(a_i,b_i)$, we'll perform Insert($b_i$) into the persistent data structure at that time. Let's think of $a_i$ as the "version number" for the persistent data structure: given $a_i$, we can find the version of the data structure right after $(a_i,b_i)$ was processed.

Notice that given any $a_j$ and any $b_k$, we can now look up the version corresponding to $a_j$ and perform the PrefixSum($b_k$) operation. This will return the value of $\sum_{a_i \ge a_j, b_i \ge b_k} a_i + b_i$. This operation takes $O(\log n)$ time. So, we can just do this operation $n$ times, once for each pair $a_j,b_j$ that appears in the original list, similar to Aidan Connelly's algorithm.

How long does this take? Sorting the data can be done in $O(n \log n)$ time. Building the persistent data structure takes $O(n \log^2 n)$ time, as we perform the Insert operation $n$ times and each instance takes $O(\log^2 n)$ time. Finally, we perform $n$ PrefixSum lookups, each of which takes $O(\log n)$ time, so that stage takes $O(n \log n)$ time. The total running time is thus $O(n \log^2 n)$ time.


General remarks. This technique of combining persistent data structures and a sweepline method is a useful one in computational geometry problems, especially when working in 2D. See also Using persistence on a constant database and What classes of data structures can be made persistent?.

D.W.
  • 167,959
  • 22
  • 232
  • 500