15

Since I'm studying for my formal languages college course, I stumbled upon these fascinating posts (One Two) which describe how to find a prime number using a regexp. As I said, a regexp, not a regular expression. Since a regular expression can match strings computed by a Finite State Automata and finding a prime number can't be done by a FSA, the regexp shown in the blog post is not entirely a regular expression since it does backtracking to match the string.

Since I've never really used any regular expression, now, my question:

How can I immediately recognize a regexp from a "true" regular expression just by looking at it?

Definitions: By regular expression, I refer to the notion as defined in formal languages. By regexp, I mean the notion supported by modern programming languages; the regexp syntax often contains additional features, such as backreferences. Regexps as seen in programming languages are strictly more powerful than formal languages style regular expressions.

peperunas
  • 263
  • 1
  • 5

1 Answers1

18

tl;dr backrefs.

As soon as there is a \1 (or any number that isn't used to escape unicode) in the regexp it is not a regular expression.

Backrefs allows you to match (a+)b\1 which matches n times a followed by b followed by n times a for any n>1. This is not a regular language (it's the poster child of a non regular language).

It is necessary and nearly sufficient that the backref references a group that contains a regexp that matches an arbitrarily long string or that it contains a * or +. The only exception (that I found) of a regexp of the form (A)B\1 where A is a finite language (could be replaced by a enumeration of all words that accepts them). You can convert it to word1+Bword1|word2+Bword2 etc. because A is finite.

Look-around groups don't remove the regularness of the regexp. A(?=B)C is the cross-section of regexes AB.* and AC and the cross-section of 2 regular languages is regular. Negative lookahead is similar except using the complement of B.* (complements of regular languages being regular). Lookbehind is exactly the same as well A(?<=B)C is the cross-section of AC and .*BC.

ratchet freak
  • 4,696
  • 1
  • 18
  • 15