Find member of CFL that is Levenshtein-closest to non-member string

Question

Is there an (efficient?) algorithm which given a context-free language $L$ (given as a grammar) and a string $x$ with $x \not \in L$ computes a $y$ with $y \in L$ and $\forall y': y' \in L \implies d(x, y) \le d(x, y')$, where $d$ is the Levenshtein distance? (Secondarily, can one enumerate all such $y$?)

The motivation is to give more useful parse error messages: if you give me an almost-valid C program, I would like to tell you "if you delete these parentheses and insert a semicolon here, it'll parse".

A family of related questions:

What happens when you vary the distance function? (Can you think of other candidates? Do non-metric distance functions make sense? Do they change the answer?)
Does the answer change if you restrict $y$ to e.g.
- $x$ is a prefix of $y$ (i.e. you can do append-only error-correction)
- $y$ is a prefix of $x$ (i.e. you can do pop-last-only error-correction)
- all the characters of $x$ occur in $y$ and in the same relative order, but maybe not adjacently (i.e. you can error-correct by inserting anywhere)
- [...] $y$ occur in $x$ [...] by deleting anywhere.
Does the answer change if you change the optimization criterion? E.g. "find a $y$ in $L$ which maximizes the length of common_prefix(x, y)", or something else?
Does the answer change for other language classes, e.g. regular?
Does this change if the language is given as $LALR(k)$, $LL(k)$, deterministic or other sub-CF grammar?

D.W. · Accepted Answer · 2016-07-03T21:08:30.667

Yes. This can be done, using Levenshtein automata.

Let $S_k = \{y \in \{0,1\}^* : d(x,y) \le k)\}$. Then the set $S_k$ is regular, and one can construct a finite-state automaton for it, called a Levenshtein automaton. Now the intersection of a CFL and a regular language is another CFL. Also, given a CFL, you can efficiently determine whether it is non-empty or not, and if not, you can efficiently find an example of an element of it.

So, we obtain the following simple algorithm:

For $k:= 1,2,3,\dots$, do:
- If $L \cap S_k$ is non-empty, find $y \in L \cap S_k$ and output it.

This can be sped up by using binary search to find the smallest $k$ such that $L \cap S_k$ is non-empty.

The same approach can handle your proposed restrictions as well, as those simply amount to intersecting with another regular language. It also works if $L$ is regular, as any regular language is necessarily context-free.

score 1 · Answer 2 · answered Jul 03 '16 at 14:21

If we limited ourselves to those $y$ with $|y| \le |x|$ (or any finite set) and a computable strict partial order, we can construct the (finite) DAG of the order on our $y$s and find the set of all nodes of in-degree 0, those being the optimal ones. Given this set, we can answer is-empty, enumerate-all and arbitrary-member queries.

This addresses the question where $y$ is a prefix of $x$ and where $y$ is a delete-anywhere substring of $x$, for any decidable language class (e.g. context-free and regular).

The downside is that for delete-anywhere, we iterate a $y$-set of size $2^{|x|}-1$ (we can omit $x$ itself since it's not a member). For prefix-only $y$s it's only $|x|$ strings which is reasonable.

score 1 · Answer 3 · edited Apr 13 '17 at 12:19

The language $E = \{x\} \cdot \Sigma^*$ is regular, as is $I = \Sigma^* \{x_1\} \Sigma^* \{x_2\} \Sigma^* \cdots \Sigma^* \{x_n\} \Sigma^*$. For any language class of $L$ for which we can compute the shortest member of $L \cap R$ for regular $R$, we can answer the insert-at-end and insert-anywhere optimization problem (note that shorter implies smaller edit distance when we have only done inserts).

This is the case for regular and context-free languages:

If we have a DFA for $L$ we can use the product construction to get the language of valid candidates $C \in \{L \cap I, L \cap E\}$ as appropriate. Given this we can test $C$ for emptiness and (if not) find a shortest path from the starting state to an accept state, giving us the desired $y$.

The intersection of a context-free and a regular language is context free. We can compute the shortest string in a context-free language, see https://math.stackexchange.com/questions/606518/the-shortest-word-in-context-free-language. (My slight modification: for $k$ in 1..*: If $A → w$ with $|w| \le k$ replace $A$ with $w$ in all right-hand-sides. Once $S$ produces a string, that's your result.)

By computing $C_i = L \cap (\{x_1 \cdots x_i\} \cdot \Sigma^*)$ for $i = 0, \ldots, |x| - 1$ and testing for non-emptiness, we can find the $y$ in $L$ which shares the longest prefix with $x$. Among them, we can also find the shortest by the above method.

Find member of CFL that is Levenshtein-closest to non-member string

3 Answers3