8

We know that regular expressions (RE) are implemented with finite automata (FA). In some language (like JavaScript) in RE there are features like 'capturing parenthesis' with 'backreferences':

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-capturing-parentheses

(x) Matches 'x' and remembers the match, as the following example shows. The parentheses are called capturing parentheses. The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and remember the first two words in the string "foo bar foo bar". The \1 and \2 in the pattern match the string's last two words.

I want to know if this pattern /(foo) (bar) \1 \2/ is in fact a RE according the definition of RE that we have in theoretical formal language or it is something more powerful. And if it is so I would like to know if this kind of feature is implemented also with FA or in another way (in particualr how it is implemented).

asv
  • 345
  • 2
  • 14

4 Answers4

9

These extended notions of regular expressions capture more than just the regular languages. For example, ([ab]*)\1 matches the language $\{ww\mid w\in\{a,b\}^*\}$, which isn't regular and isn't even context-free (Example 2.38 of Sipser, Introduction to the Theory of Computation, 3rd edition).

"Regular" expressions that don't match regular langauges can't be translated to finite automata, since finite automata match only the regular languages. A side-effect of this is that many libraries don't even try to compile to automata, which can lead to extremely slow matching, even when a "regular" expression is a true regular expression. Russ Cox wrote an excellent article about this, which goes into a lot of the history, too.

David Richerby
  • 82,470
  • 26
  • 145
  • 239
8

The RE in the Automata Theory are equivalent to FA, but for the programming languages (regexp) this is no longer true.

The regular expressions in the programming languages (like PCRE) are far more powerfull than Regular Expressions (type 3) in the Automata Theory.

The matching parenthesis is neither regular nor context-free, it is a context-sensitive feature. But the RegExp from the question does not fully support Type 2 or Type 1.

The bracket matching is not implemented via FA. In case of PCRE it is implemented by guessing and backtracking.

Take a look at Perl Monks description about PCRE.

Evil
  • 9,525
  • 11
  • 32
  • 53
8

The answers are probably answering what you're intending to ask, but not what you're asking.

I want to know if this pattern /(foo) (bar) \1 \2/ is in fact a RE according the definition of RE that we have in theoretical formal language or it is something more powerful. And if it is so I would like to know if this kind of feature is implemented also with FA or in another way (in particualr how it is implemented).

In fact this is a regular expression that can be implemented with a finite automaton, because \1 is guaranteed to evaluate to foo and \2 is guaranteed to evaluate to bar.

Therefore a regex engine could use this fact to actually create a finite automaton that exactly describes the language you proposed.

However, if you make any captures conditional, then this may become false, as others have mentioned.

(Note that I say you may have trouble, because a language like /(a(aa|aa)|(aa|aa)a)\1\2/ can still be described via a FA. I only gave you a necessary condition, not a sufficient one. Edit: It just occurred to me that having a conditional is neither necessary nor sufficient, as /(a*)\1/ can also be turned into a finite automaton whereas /(ab*)\1/ cannot. So I guess it was just a rule of thumb.)

user541686
  • 1,187
  • 1
  • 10
  • 17
0

Certain regex implementation does not build a DFA. For example, java.util.regex OpenJDK implementation does not. As a result, its matching time is slower than DFA compiled implementation like dk.brics.automaton. Yet the later does not support capturing group as a result of underlying implementation.

vuamitom
  • 1
  • 1