15

Over at this question about inversion counting, I found a paper that proves a lower bound on space complexity for all (exact) streaming algorithms. I have claimed that this bound extends to all linear time algorithms. This is a bit bold as in general, a linear time algorithm can jump around at will (random access) which a streaming algorithm can not; it has to investigate the elements in order. I may perform multiple passes, but only constantly many (for linear runtime).

Therefore my question:

Can every linear-time algorithm be expressed as a streaming algorithm with constantly many passes?

Random access seems to prevent a (simple) construction proving a positive answer, but I have not been able to come up with a counter example either.

Depending on the machine model, random access may not even be an issue, runtime-wise. I would be interested in answers for these models:

  • Turing machine, flat input
  • RAM, input as array
  • RAM, input as linked list
Raphael
  • 73,212
  • 30
  • 182
  • 400

4 Answers4

15

For streaming algorithms to be meaningful, they have to work with significantly smaller amount of work space than the input itself. For example, if you allow the same amount of work space as the input, then you can trivially state any algorithm as a “single-pass streaming algorithm” which first copies the input to the work space in single pass and then only use the work space.

I think that it is typical to restrict the work space to at most polylogarithmic in the input size when talking about streaming algorithms. Under this assumption, median selection does not have an O(1)-pass streaming algorithm by the result of Munro and Paterson [MP80]: any P-pass streaming algorithm for median selection on N elements has to store Ω(N1/P) elements. On the other hand, median selection has a well-known deterministic linear-time algorithm [BFPRT73].

[BFPRT73] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448–461, Aug. 1973. DOI: 10.1016/S0022-0000(73)80033-9

[MP80] J. Ian Munro and Mike S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, Nov. 1980. DOI: 10.1016/0304-3975(80)90061-4

Tsuyoshi Ito
  • 2,462
  • 21
  • 15
6

In the streaming model you are only allowed to store constant or poly-logarithmic extra-data while scanning through the input. If you consider a linear time algorithm
that follows the divide and conquer paradigm, you need to store more information and/or you should scan through your data as many times as the depth of the recursion.

One example is the DC3 algorithm for constructing the suffix array of a text $T$ (given as array in the RAM model). In order to construct a suffix array, you group the characters into triplets, so you get a text with new super-characters. You can do this with an offset of $0,1,2$, which results in three new texts $T_1,T_2,T_3$. Interestingly, you can compute the suffix array if you have the suffix array of $T_1\cdot T_2$ in linear time. Hence the algorithm needs

$$ t(n)= t(2/3 \;n) + O(n) $$

time. This recursion solves clearly to $t(n)=O(n)$. I don't see how this can be turned into a streaming algorithm.

Another well known example is the classical linear-time selection algorithm.

A.Schulz
  • 12,252
  • 1
  • 42
  • 64
4

I interpret your question as follows. Let's fix some computational problem $P$. We define:

  • $R(P)$ is the smallest workspace that any linear time random access algorithm for $P$ can have. I think the exact model does not matter all that much, but let's say that the we have a word RAM which is given the input as a random-access read-only array.
  • $S(P)$ is the smallest workspace that a sequential algorithm for $P$ can have; here we assume that the algorithm (which is again modeled as a word RAM machine) proceeds in time steps: at each time step one cell of the input array is given, the algorithm does some processing, records some information in its local storage, and then proceeds to the next time step. The array is "looped over" a constant number of times in this manner.

So I think you are asking about how big the gap between $R(P)$ and $S(P)$ can be.

On the low end of the spectrum, there is an answer in Muthu's streaming book. Look at puzzle 3. The problem is, given an array of $n$ integers, all in the range $[1, n-1]$, find a duplicated integer. There exists a random access linear time solution with $O(\log n)$ bits (equivalently, $O(1)$ number of words): basically pointer-chasing. But a constant pass algorithm must necessarily have $\omega(\log n)$ space complexity.

There is an aspect of the model that we have not quite fixed: what direction do the passes go? I.e., once we feed the streaming algorithm the entire array, does the next pass start from the end or the beginning? Interestingly, this can make a big difference. Consider the problem of recognizing a well-parenthesizes expression where there are two types of parentheses. Say we want the probability of error to be $O(1/\log^2 n)$. Then Chakrabarti et al. show that if we restrict all passes to go in one direction, we have $ps = \Omega(\sqrt{n})$, where $p$ is the number of passes and $s$ is the space complexity. On the other hand, Magniez et al. give a simple algorithm that uses $O(\log^2 n)$ space, has polynomially small probability of a false positive, and makes one pass forward and one backwards.

Sasho Nikolov
  • 2,587
  • 17
  • 20
1

Even in the simplest definition of "streaming algorithm" (an algorithm which, after every incremental iteration on the source, results in the immediate knowledge of the next incremental piece of the result), I can think of a few linear algorithms that don't behave that way. Hashing algorithms are a big one; FNV-1a is linear to the number of bytes in the source, but we don't know any part of the final hash until the full source has been processed.

RadixSort aka BucketSort is O(N) (technically O(NlogM) where M is the maximum value in the N items, which is considered small), and must run in its entirety to guarantee that any individual item is in its final place.

To be a "streaming" algorithm, at its simplest, an algorithm must have the following two properties, neither of which are expressly time-bound:

  • Better than O(N) space complexity (stated equivalently, the entire source doesn't have to be known and the entire result doesn't have to be stored)
  • O(N) I/O relationship (the algorithm produces a number of outputs linearly proportional to its inputs)

Therefore, the main class of algorithms that stream is that of algorithms that perform "projections" (incremental transformations of one input to X>0 outputs).

KeithS
  • 414
  • 2
  • 5