Why is using a lexer/parser on binary data so wrong?

Question

I often work with lexer/parsers, as opposed to a parser combinator and see people who never took a class in parsing, ask about parsing binary data. Typically the data is not only binary but also context sensitive. This basically leads to having only one type of token, a token for byte.

Can someone explain why parsing binary data with a lexer/parser is so wrong with enough clarity for a CS student who hasn't taken a parsing class, but with a footing on theory?

score 11 · Accepted Answer · edited Mar 30 '12 at 15:42

In principle, there is nothing wrong.

In practice,

most non-textual data formats I know are not context-free and are therefore not suitable for common parser generators. The most common reason is that they have length fields giving the number of times a production has to be present.

Obviously, having a non context-free language has never prevented the use of parser generators: we parse a superset of the language and then use semantic rules to reduce it to what we want. That approach could be used for non-textual formats if the result would be deterministic. The problem is to find something else than counts to synchronize on as most binary formats allow arbitrary data to be embedded; length fields tell you how much it is.

You can then start playing tricks like having a manually writen lexer able to handle that with feedback from the parser (lex/yacc handling of C use that kind of tricks to handle typedef, for instance). But then we come to the second point.
most non-textual data formats are quite simple (even if they are not context-free). When the counts mentioned above are ignored, the languages are regular, LL1 at worst, and are thus well suited for manual parsing techniques. And handling counts is easy for manual parsing techniques like recursive descent.

Alex ten Brink · Answer 2 · 2012-04-04T18:24:05.920

Let's categorize data into three categories: data readable by humans (usually texts, varying from books to programs), data intended to be read by computers and other data (parsing images or sound).

For the first category, we need to process them into something a computer can use. As the languages used by humans can generally be captured relatively well by parsers, we usually use parsers for this.

An example of data in the third category would be a scanned image of a page out of a book which you want to parse into text. For this category, you almost always need very specific knowledge about your input, and therefore you need a specific program to parse it. Standard parsing technology won't get you very far here.

Your question is about the second category: if we have data that is in binary, it is almost always a product of a computer program, intended for another computer program. This immediately also means that the format the data is in is chosen by the program responsible for its creation.

Computer programs almost always produce data in a format that has a clear structure. If we parse some input, we are essentially trying to figure out the structure of the input. With binary data, this structure is generally very simple and easy to parse by computers.

In other words, it's normally a bit of a waste to figure out the structure of an input for which you already know the structure. As parsing isn't free (it takes time and adds complexity to your program), this is why using lexers/parsers on binary data is 'so wrong'.

Gilles 'SO- stop being evil' · Answer 3 · 2012-04-04T18:24:18.923

If a language needs to be parsed in some non-trivial way, it usually means that structural elements need to be matched, so the input language contains redundancy, either because multiple inputs map to the same parse tree or because some input strings are invalid. Humans like redundancy. For example, most humans find binary operators more readable than a pure prefix or suffix notation for elementary arithmetic: $a + b \times (c - d) + e$ rather than (+ a (* b (- c d)) e) or a b c d - * + e +. The usual mathematical notation has more redundancy than Lisp (which requires more parentheses, but gets variable arities for free, so requires fewer symbols to express expressions using large arities) or RPL (which never needs parentheses). Such redundancy is rarely useful to computers — and where it is, which is when there may be errors in the data, the error correction logic is usually kept separate from the functional meaning of the data, for example using error correcting codes which apply to arbitrary byte sequences regardless of what they represent.

Binary formats are usually designed to be compact, which means few simple language features such as balanced parentheses that are expressible by context-free grammars. Furthermore, it is often useful for binary representations of data to be canonical, i.e. to have a single representation of each object. This rules out sometimes-redundant features such as parentheses. Another, less commendable consequence of having less redundancy is that if every input is syntactically correct, it saves on the error checking.

Another factor against nontrivial parsers for binary data is that a lot of binary formats are designed to be parsed by low-level code that likes to operate in constant memory with little overhead. Fixed sizes are prefered when applicable to allowing arbitrary repetition of an element. A format such as TLV that allows a left-to-right parser to allocate the right amount of memory for an object first, then read the representation of the object. Parsing from left to right is an advantage because it allows the data to be processed as it comes, with no intermediate buffer.

Why is using a lexer/parser on binary data so wrong?

3 Answers3