Time Complexity of Sort-Merge Join

Question

According to this German Wikipedia article, the time required to merge relations $R$ and $S$ is $\in \mathcal{O}(|R| + |S|)$ if both relations are already sorted. [Note: You don't really need to read the text and the link jumps right to where the time complexity is stated. But I added a translation of that section to this question for some clarity.]

Assume $R = S$ with $R$ having 2 columns. One is a random number and one is always 5 for every line. Join $R$ and $S$ on the column which is 5 for every line. The resulting output's space complexity is $\in \Theta(|R| \cdot |S|)$. Time complexity is always $\in \Omega(\text{spaceComplexity})$.

How can the time complexity stated by the Wikipedia article be true?

Here is the "Sort-Merge Join" section translated to English. Sorry, no markdown quote, just everything from here on is the quote. It's translated poorly (not how you would write it in English if you know what you're writing) in some parts to preserve the meaning as well as possible. I marked my own inline comments with ///.

Quote of https://de.wikipedia.org/wiki/Joinalgorithmen#Sort-Merge_Join

Sort-Merge Join

Both relations get sorted by their join attributes. The result can be determined via a single scan through both sorted relations.

The algorithm is only suited for natural join and equi-join.

Pseudocode

Implementation of $R\bowtie_{R.a=S.a} S$ in pseudocode:

p_r := first tuple in R
p_s := first tuple in S
while(p_r != endof_r && p_s != endof_s)
    // Collect all tuples in S with the same join attributes.
    M_s := set with contents p_s /// Yes, it says "with", not "of".

    foreach(t_s in S > p_s)
        if(t_s.a = p_s.a)
            M_s += set with contents t_s
        elseif // I think they mean "else".
            p_s := t_s
            break foreach
        endif
    endforeach

    // Seach suitable start tuple in R. /// "passend" can also be translated "fitting" or "matching", not just "suitable".
    foreach(t_r in R > p_r)
        p_r = t_r
        if(t_r.a >= t_s.a)
            break foreach
        endif
    endforeach

    // Output tuples.
    foreach(t_r in R > p_r)
        if(t_r.a > t_s.a)
            break foreach
        endif

        foreach(t_s in M_s)
            Write output: (t_r, t_s)
        endforeach
        p_r = t_r
    endforeach
endwhile

Evaluation

Sorting can be done with effort $\mathcal{O}(|R|\log|R|+|S|\log|S|)$. The number of block accesses to sort $S$ is $b_s\left(2\log\frac{b_s}{b_{free}}\right)+b_s$ in the worst case, analogous for $R$.

A merge of both relations after sorting them costs $\mathcal{O}(|R|+|S|)$. In the best case – i.e. the relations are already sorted –, the costs of merging are the only ones.

In the normal case, the total costs are $\mathcal{O}(n\log n)$.

Variants

[Not translated because it doesn't seem to be important for the question.]

score 5 · Accepted Answer · answered Jan 09 '17 at 02:33

You are absolutely correct. Wikipedia has an error -- or perhaps, if we are feeling more charitable, we could call it an oversimplification.

It is not true that the running time is at most $O(|R|+|S|)$. For instance, if we consider the case where the value of attribute $a$ is 42 for all elements of $R$ and $S$, we output $|R| \times |S|$ tuples. It is also easy to see that the pseudocode does $|R| \times |S|$ iterations of the nested inner loop (i.e., that many iterations of the statement "Write output:"), so the running time is also $O(|R| \times |S|)$.

Here are some statements that are true:

If there are no repeated values of $a$ (i.e., each value for $a$ appears at most once in $R$ and at most once in $S$), then the running time for the merge is $O(|R|+|S|)$, and the size of the output is also $O(|R|+|S|)$.
If any value appears at most $c$ times in attribute $a$, where $c$ is a constant, then the running time for the merge is $O(|R|+|S|)$, and the size of the output is also $O(|R|+|S|)$.
If we count only the I/O complexity (the number of disk/cache transfer operations), and if memory is large enough to hold $|S|$ items, then the I/O complexity is $O(|R|+|S|)$... though the running time and the size of the output might be as large as $O(|R| \times |S|)$. More generally, the same statement is true if any value for attribute $a$ appears at most $k$ times in $S$, and main memory is large enough to hold at least $k$ items (here $k$ does not need to be a constant).

So Wikipedia's statement is misleading or wrong or (at best) over-simplified. Perhaps what they really meant was one of the bullet items above. Your understanding is absolutely correct.