What would a formal grammar for a binary file format look like?

Question

Binary structures often feature length specifiers; the parser is supposed to read them and then consume the specified amount of symbols. Because of this, the grammar is context-sensitive.

What would a context-sensitve grammar for a simple binary format look like?

For example, let's consider a length-prefixed array with the following layout:

64 bits | [size * 8] bits
size    | data

I suppose the language that corresponds to this format would be:

$$n b^n \\ n \in [0, 2^{64}[ \\ b \in [0, 2^8[$$

Left-context-sensitive grammars have the following structure:

$$\alpha A \rightarrow \alpha \gamma$$

I don't understand how this formalism could generate or be used to parse a language which contains part of the grammar itself in it. Does the following grammar make any sense?

\begin{align} Bit & \rightarrow 0 \\ Bit & \rightarrow 1 \\ Octet & \rightarrow Bit^8 \\ Size & \rightarrow Octet^8 \\ Size \; Data & \rightarrow Size \; Octet^{Size} \\ Array_s & \rightarrow Size \; Data \\ \end{align}

I reason that the grammar above matches the structure of left-context-sensitive grammars. It seems clear to me that the next-to-last line of the grammar directly corresponds to the definition above:

\begin{align} \alpha & = Size \\ A & = Data \\ \gamma & = Octet^{Size} \\ \end{align}

The trickiest non-terminal seems to be $Size$. It is not clear to me exactly how it influences the grammar's semantics. It is simultaneously part of the input and the grammar, serving as the number of repetitions for $Octet$.

I have seen many grammar-based approaches to parsing binary file formats. They include formalisms such as attribute grammars,^[1]^[2] adaptive grammars, the recently introduced data-dependent grammars,^[1]^[2] parser combinators ^[1]^[2]^[3]^[4]^[5] and even scattered context grammars.^[1]

All these tools go beyond the definition above, so I keep wondering about the nature of binary file formats. If it is not possible to describe their language with a context-sensitive grammar, does that mean they are more powerful in the Chomsky hierarchy?

score 4 · Answer 1 · answered Aug 19 '16 at 15:04

The Chomsky hierarchy is a hierarchy of grammars described by a given formalism. The other grammar descriptions that you are giving (attribute grammars, ...) are not part of that formalism so they are outside the Chomsky hierarchy (and if I'm not mistaken are able to recognize some languages of higher level in the hierarchy while being unable to recognize other languages at a lower level).
Context-sensitive grammars are not at the top of the Chomsky hierarchy, unrestricted grammars are.
Context-sensitive grammars can generate what you are asking for, but it's painful to write. Here is a grammar which generate a s followed by a binary number, followed by as many p as determined by the binary number.

\begin{align} S & \rightarrow \text{s} \; C \\ \text{s} \; C & \rightarrow A \\ \text{s} \; C & \rightarrow \text{s} \; B \; \text{p} \\ \text{s} \; C & \rightarrow \text{s} \; B \; C \; \text{p} \\ A \; C & \rightarrow A \; D \\ A \; D & \rightarrow B \; D \\ B \; D & \rightarrow B \; \text{p} \\ B \; D & \rightarrow B \; C \; \text{p} \\ B \; C & \rightarrow B \; E \\ B \; E & \rightarrow X \; E \\ X \; E & \rightarrow X \; \text{p} \\ X \; E & \rightarrow X \; C \; \text{p} \\ A \; X & \rightarrow A \; F \\ A \; F & \rightarrow B \; F \\ B \; F & \rightarrow B \; A \\ B \; X & \rightarrow B \; G \\ B \; G & \rightarrow X \; G \\ X \; G & \rightarrow X \; A \\ \text{s} \; X & \rightarrow \text{s} \; B \; A \\ A & \rightarrow 0 \\ B & \rightarrow 1 \\ \end{align}

Yes that's a lot of rules (I may have missed some shortcuts, but on the other hand I may have missed some cases where unwanted strings are generated which could be avoided only with a more complex grammar). Some explanations:

I'm using $A$ and $B$ for temporary place holder for $0$ and $1$ because context sensitive rules do not allow to modify terminals. The last two rules replace them by the terminals they were acting for.
the first three rules are pretty obvious.
the fourth is starting the count
the next four are transforming a count ending by a $0$ in a count ending by a $1$ and a $\text{p}$
the next four are transforming a count ending by a $1$ in a count ending by a $X$ (meaning a carry to be propagated) and a $\text{p}$
the last seven rules are doing the carry propagation.

What would a formal grammar for a binary file format look like?

1 Answers1