5

I have a file that has a number (a positive integer) on each row.

Given a number $q$, I want to find a value that's a sum of some 8 numbers in the file, and is as close to $q$ as possible.

So, supposing that I have a file that looks like this:

345
3
2
3453
1234
6
7
34
12
1111
48
413

If the given number is 526 I would like to get a solution like: 3+2+6+7+34+12+48+413 = 525. So those are the 8 numbers that best fit our condition.

Is there an efficient algorithm with a low complexity?

The file can have >1000 rows and the numbers won't be bigger than 10k.

D.W.
  • 167,959
  • 22
  • 232
  • 500
Dex
  • 153
  • 7

2 Answers2

5

Dynamic programming

One approach is to use dynamic programming. If you have $n$ numbers ($n$ rows in the file), and each number is in the range $1..m$, then the obvious dynamic programming algorithm has running time about $8mn$. For your parameter size, this might be adequate.

In particular, let $A[1..n]$ be an array of your $n$ numbers. Define $T[x,i,j] = 1$ if there is a way to write $x$ as a sum of $j$ numbers from among $A[1..i]$, or 0 if not. Now we have the relation

$$T[x,i,j] = T[x,i-1,j] \lor T[x-A[i],i-1,j-1]$$

(with the convention that $T[x,i,j]$ always evaluates to 0 if $x$ is negative). Consequently, you can compute all of the $T[x,i,j]$ values iteratively. There are $8mn$ values to compute, so this takes $8mn$ time.

Once you have filled in the $T$ table, now you can answer any query quickly. Suppose you want to find the closest number to $q$ that can be written as a sum of 8 numbers. Then you should scan $T[q,n,8]$, $T[q+1,n,8]$, $T[q-1,n,8]$, $T[q+2,n,8]$, $T[q-2,n,8]$, etc., until you find the first entry that is one. This takes at most $m$ iterations.

In total, the running time is $O(mn)$, where the constant hidden by the big-O notation is small. The space required is $O(mn)$ as well, but this can be reduced to $O(m)$ if you fill in the $T$ table in the right order and discard entries once they are no longer needed.

Two-way merge

There's an alternative algorithm that will probably be significantly faster for your specific problem. The basic idea is to first find an algorithm that works for sums of 2 numbers, then apply that repeatedly.

Let $S$ be a set of numbers. Define $S \bowtie S$ to be the set

$$S \bowtie S = \{s+t : s, t \in T\}.$$

In other words, $S \bowtie S$ is the set of all possible sums of two values from $S$. In our case, we're going to represent $S$ as a sorted list of numbers.

It turns out you can compute $S \bowtie S$ from $S$, using a FFT-based convolution procedure. Let $N$ be the smallest power of two that is larger than $2m$. Define a $N$-vector $v$ whose $i$th coordinate is 1 if $i\in S$, or 0 otherwise. Compute $v \otimes v$, the convolution of $v$ with itself. It turns out that $v \otimes v$ is the $N$-vector for $S \bowtie S$, so you can recover $S \bowtie S$ from $v \otimes v$. Also, you can compute the convolution $v \otimes v$ using a discrete FFT over the $N$th complex roots of unity: you take the FFT of $v$, square every entry, then take the inverse FFT. The running time of the FFT is $O(N \lg N)$. Since $N=O(m)$, it follows that the running time of this procedure is $O(m \lg m)$.

Now we repeat this recursively three times. In other words, we let $S = $ the set of numbers in the original file, compute $T = S \bowtie S$ (the set of 2-way sums), compute $U = T \bowtie T$ (the set of 4-way sums), and then compute $V = U \bowtie U$ (the set of 8-way sums). This gives us the set of all numbers that can be represented as a 8-way sum.

If you like, a more elegant way to compute the set of 8-way sums is as follows: let $N$ be the smallest power of two that is larger than $8m$, form a $N$-vector $v$ where $v_i=1$ if and only if $i$ is part of the original file, take the discrete FFT of $v$ (over $N$th complex roots of unity), raise each entry to the 8th power, then take the inverse FFT. This gives you a $N$-vector that has a 1 in its $i$th coordinate if and only if $i$ is an attainable 8-way sum.

Finally, given a query $q$, we can quickly look it up in $V$ (using binary search, say) and find the nearest number to $q$ that's in $V$.

The total running time will be $O(m \lg m)$, and the total space will be $O(m)$. For your parameter settings, this should be extremely fast, given that $m$ is no larger than 10,000. However, the implementation complexity is much higher, since you need to implement a FFT, worry about round-off error and precision for complex numbers, and so on.

D.W.
  • 167,959
  • 22
  • 232
  • 500
1

For the dynamic programming approach, I would first sort the numbers. Let $s_1$ be the sum of the largest 8 numbers. If $s_1 ≤ s$ then the solution is $s_1$.

Let $s_2$ be the sum of the smallest 8 numbers. If $s_2 ≥ s$ then the solution is $s_2$. (These two steps eliminate some extreme cases, especially when s is very small or very large compared to the other numbers)

Otherwise, replace the 8th smallest number with the next larger one as long as the sum is ≤ s. Then replace the 7th smallest number with the next larger one as long as the sum is ≤ s and the 7th smallest is not the same as the 8th smallest, and so on. Let $s_3 ≤ s$ be the sum that you obtained this way. If by coincidence $s_3 = s$ then $s_3$ is the solution. Otherwise, $s_3$ may be quite close to $s$. Let $L = r_3$ and $R = s + (s - r_3)$; then the solution x of the problem is in the range $L ≤ x ≤ R ≤ 2s$. Having an upper bound for the solution will reduce the amount of work that needs doing. And as long as the sum of the 7 smallest and the largest number exceed R, we remove the largest number from the problem.

(In the example (sorted) 2 3 6 7 12 34 48 345 413 1111 1234 3453 we quickly find that the smallest 7 numbers add up to 112, making s_3 = 525, so solutions are in the range 525 ≤ x ≤ 527, which removes all numbers greater than 415, leaving only 9 numbers. )

Then I would sort the numbers in order of (distance to s/8), with the ones closest to s/8 coming first. As a result, the dynamic programming approach will from the start find sums of eight numbers close to s. Especially if we have many numbers, this increases our chances that a sum exactly equal to s is found quickly, in which case we are finished early. And whenever we find a sum between L and R, we can improve the bounds L ≤ x ≤ R for solutions.

Once all small numbers (all numbers ≤ s/8) are processed, and the smallest remaining number is X, the dynamic programming doesn't need to find any sums of seven numbers that are greater than R - X (since adding an eighth number gives a result > R), we don't need to find any sums of six numbers that are greater than R - 2X, and so on, which again will decrease the amount of work needed.

If there are no numbers much larger than s/8, and the largest number is X, then we can ignore all sums of seven numbers less than L - X. If the sum of the largest two numbers is $X_2$ then we can ignore all sums of six numbers less than L - $X_2$ and so on.

gnasher729
  • 32,238
  • 36
  • 56