2

What is the SetHorspool string searching algorithm with pseudo-code so it can be easily implemented in a language of choice?

This has been implemented in 2 libraries I have come across:

But there seem to be very little details of the algorithm available online to understand its working.

1 Answers1

2

Reading the source code we can see that it is an extension of Horspool to find a match from a set of needle strings $S$, rather than matching a single needle string.

How it works is by representing the set $S$ as a trie. This allows you to efficiently check character by character whether starting at position $i$ any of the strings on the trie can be found in the haystack. And like in Boyer–Moore–Horspool, if a position $i$ of the haystack fails to match any of the strings in $S$, instead of simply trying $i' = i+1$, a smarter skip lookup table is used: $i' = i + T[\text{haystack}[i + |\text{needle}| - 1]]$.

The second difference is how $T$ is computed. Normally the pseudocode is as follows (from the Wikipedia article):

function preprocess(pattern)
    T ← new table of |Σ| integers
    for i from 0 to |Σ| exclusive
        T[i] ← length(pattern)
    for i from 0 to length(pattern) - 1 exclusive
        T[pattern[i]] ← length(pattern) - 1 - i
    return T

But in SetHorspool, the minimum safe skip considering all patterns is computed instead:

function preprocess(S)
    T ← new table of |Σ| integers
    for i from 0 to |Σ| exclusive
        T[i] ← min(length(pattern) : pattern ∈ S)
    for pattern in S
        for i from 0 to length(pattern) - 1 exclusive
            T[pattern[i]] ← min(T[pattern[i]], length(pattern) - 1 - i)
    return T
orlp
  • 13,988
  • 1
  • 26
  • 41