3

I recently worked on an algorithm which, among other things, checks strings for equality using the classic builtin equality operator:

str1 == str2

(I think it should be irrelevant to the question, but I faced this issue in C++, and str1 and str2 are obviously std::strings.)

At a later point I decided to relax the equality condition between strings such that it actually does

trimspace(str1) == trimspace(str2)

where trimspace removes leading and trailing whitespace.

I innocently thought that the complexity (at least in the worst case scenario of equal strings) would stay the same, because just like the original code has to check each and every corresponding character of the two strings for equality, also the new one would do the same, just after skipping any leading or trailing whitespace, which is also done by visiting each space character only once.

In reality I observed an enormous slowdown.

Stepping through the code with a debugger, I eventually realized that the str1 == str2 call was translated to a single call to memcmp, which is a builtin that checks two char arrays for equality, whereas the trimspace(str1) == trimspace(str2) resulted in as many calls to memchr as the total number of leading and trailing whitespaces of the two strings plus 4, plus one call to memcmp for comparing the trimmed strings.¹

This clearly resulted in the slowdown I mentioned, I guess because of the cost of the function calls.

I see that the fundamental problem is that, despite the two expressions require both O(N),

  • the first one, str1 == str2, which requires all characters to be compared, doesn't require them to be compared in order, because they can all be compared at the same time, in parallel, so the CPU can take advantage of that;
  • on the other hand the expression trimspace(str1) == trimspace(str2) requires to traverse the string in an ordered way (from left to right and from right to left) to trim the (leading and trailing) spaces one by one.

How is this difference between the two computations above formalized in the computational complexity theory?


¹ For example, to compare hello (3 spaces before and 2 spaces after the word hello) with itself, the 3 leading and 2 trailing characeters are successfully compared for equality with the space character, and the h and o are checked unsuccessfully, thus terminating the trimming; and this happens for the two copies of the string, resulting in 14 characters being compared for equality with the speace character, each comparison happening via a call to memchr; then hello is compared with its identical counterpart via one single call to memcmp.


I initially posted this question on stackoverflow; you can see it (deleted) here.

Enlico
  • 127
  • 9

3 Answers3

2

From the viewpoint of complexity theory, comparing two strings, with or without surrounding whitespace, takes time linear in the lengths of the strings, because there is no computational model, practical or afaik theoretical, in which parallel computation is unlimited. A real-world machine could do parallel comparison of a certain length of strings, which would reduce the constant factor of the total comparison by the degree of parallelisation, but the constant will not reach 0, and the computation will be asymptotically linear.

In practice, there are a lot of possible optimisations which could make a huge difference, at least if it's unlikely that the strings being compared are equal. That's not very relevant to complexity theory, but to illustrate a few:

  • If the representation of strings includes their length, then you could start str1 == str2 by comparing the lengths, in O(1) time. Depending on the frequency distribution of string lengths, this might result in sublinear average time. But it won't work for trimmed strings, since you have to trim the strings before comparing the lengths.

  • Similarly, you can stop the scan as soon as you find a pair of differing characters. But in the case of trimmed strings, your algorithm starts by trimming, which takes time linear in the amount of surrounding whitespace. Again, depending on the frequency distribution of the arguments to the function, this can have a large effect. If, for example, strings are uniformly distributed over a large alphabet, then the first difference will be found in linear time, on average. (However, comparing two equal strings will result in worst-case (linear) performance.) But for the trimmed strings, the work involved in trimming will depend on the frequency distribution of surrounding whitespace, which could have much more impact than the cost of comparing the first few characters.

In practical terms, writing trimspace(str1) == trimspace(str2) is certainly the clearest and probably advisable in many cases, but in performance critical code you would be well-advised to find an algorithm with better average-time performance. For example, many languages would force the creation of two new string objects in order to store the results of the trimspace calls; copying objects is not free (indeed, it takes linear time), and the copy is completely unnecessary.

Similarly, it's not necessary to trim the ends of the strings unless the strings have matched equal up to that point. So a better algorithm might be:

  • Find $i_1$ and $i_2$, the indices of the first non-whitespace characters in $s_1$ and $s_2$.
  • While $s_1[i_1] = s_2[i_2]$, increment both $i_1$ and $i_2$.
  • If, while performing the above test, one of the indices reaches the length of its corresponding string, check whether all the remaining characters (if any) in the other string are whitespace. If a non-whitespace character is encountered, return False; otherwise, return True.

That will still take $O(n+m)$ in the worst case (which is when the strings are equal), but if equality is not a common case, it's likely to be sublinear on average.

rici
  • 12,150
  • 22
  • 40
0

If the m-th of n characters is different, comparison takes O(m) time to find the strings are different. If strings where preceded by k and k' white space characters, you could do the comparison of unequal strings in O(k+k'+m).

If you call a function first that trims the strings, then it depends on the implementation. Some implementations may return the same string in O(1) if there is neither leading nor trailing whitespace. Some implementations may be able to return objects representing a substring of another string, in O(l + t) if there are l leading and t trailing whitespace characters. Other implementations will return a copy of the relevant characters, typically in O(n).

gnasher729
  • 32,238
  • 36
  • 56
0

Analysing the complexity, answering whether two strings are equal can be done in linear sequential time by comparing them character by character, so the runtime is in $O(n)$ if the strings have length $n$. Comparison can be done with memcmp, but from the complexity point of view, that is just a detail.

If you ask whether two strings are equal when you ignore leading or trailing whitespaces is a completely different problem. One way to solve it is to trim both strings, but the complexity of trimming itself depends on the number of characters considered "whitespace". In the worst case, if you have a string of length $n$ and $k$ possible whitespace characters, trimming can take $\Theta(kn)$ because you have to check every all characters before and including the first non-whitespace with (in worst-case) all whitespace characters, and do the same from the end.

That's your "number of whitespaces+4" calls to memchr for trimming two strings. They're not constant time, but linear in the number of whitespaces.

So your solution to check whether two strings are equal ignoring leading and trailing whitespaces has complexity $O(kn)$.

Theoretically speaking, this makes no difference, since $k$ is a small constant, but practically, trimming a string is more work than comparing it to another string.

Bastian J
  • 101
  • 1