Is there an (efficient?) algorithm which given a context-free language $L$ (given as a grammar) and a string $x$ with $x \not \in L$ computes a $y$ with $y \in L$ and $\forall y': y' \in L \implies d(x, y) \le d(x, y')$, where $d$ is the Levenshtein distance? (Secondarily, can one enumerate all such $y$?)
The motivation is to give more useful parse error messages: if you give me an almost-valid C program, I would like to tell you "if you delete these parentheses and insert a semicolon here, it'll parse".
A family of related questions:
- What happens when you vary the distance function? (Can you think of other candidates? Do non-metric distance functions make sense? Do they change the answer?)
- Does the answer change if you restrict $y$ to e.g.
- $x$ is a prefix of $y$ (i.e. you can do append-only error-correction)
- $y$ is a prefix of $x$ (i.e. you can do pop-last-only error-correction)
- all the characters of $x$ occur in $y$ and in the same relative order, but maybe not adjacently (i.e. you can error-correct by inserting anywhere)
- [...] $y$ occur in $x$ [...] by deleting anywhere.
- Does the answer change if you change the optimization criterion? E.g. "find a $y$ in $L$ which maximizes the length of common_prefix(x, y)", or something else?
- Does the answer change for other language classes, e.g. regular?
- Does this change if the language is given as $LALR(k)$, $LL(k)$, deterministic or other sub-CF grammar?