Categories
General

Parsing binaries, applicative stylee

I’ve been writing a parser for a random file format. As usual, I spent more time playing around with simple functions than anything else. But along the way I learned more about Control.Applicative, and in particular how to write use it to write super-concise Data.Binary parsers.

My file format starts with header containing some “magic” bytes, then an app version (1 byte) and a data version (3 bytes) separated by NULLs. The parsed header will have type Header, as follows:

data Header = Header { dataVersion :: Word8, appVersion :: (Word8,Word8,Word8) } deriving Show

magic = (0x5::Word8, 0x6::Word8,0x7 :: Word8)

I’m using Data.Binary, and the parser is just this …

getHeader :: Get Header
getHeader = do 
  fail "Magic word bad" `unlessM` ((== magic) <$> get )
  Header <$> (skip 1 *> get <* skip 1) <*> (get <* skip 3)

So there’s a lot of cool stuff going on here!

  1. To parse the “magic” bytes, I really wanted perl’s “die unless ok” idiom. I couldn’t find anything in the standard library, so I wrote my own “unlessM”. Used with backticks, we end up with code which looks like “die_action `unlessM` everything_is_fine”. The implementation, pointlessly concise is just:

    unlessM :: Monad m => m () -> m Bool -> m ()
    unlessM m b = b >>= (`unless` m)
    
  2. The ‘get’ action on the first line yields a triple of Word8’s (thank you type inference). The “(== magic)” section compares it with the expected value. To apply the function to the result of the action, we could use liftM however Control.Applicative provides the nicer <$> – reminiscent of normal function application using ($) but, here, in a monadic setting. By the time we’ve plugged those two together, we get a Bool-yielding action .. exactly what unlessM wants.
  3. After parsing the “magic bytes”, we need to skip a byte, grab the app version byte, skip a byte, grab the three data-version bytes, and skip a final 3 bytes .. and then return a Header value. All of this is done in a single line:

    Header <$> (skip 1 *> get <* skip 1) <*> (get <* skip 3)
    

    Reading from left to right, this says:

    • We’re going to return a value built using the Header constructor, a function which takes 2 arguments.
    • Header’s first argument comes from a skip-get-skip action. More on this next, but note that <$> is just function-applied-to-result-of-computation which we’ve seen already in line 1.
    • To grok the skip-get-skip action, first imagine that the *-operators mean simply “and then..”, ie. the three actions are sequenced. In practise, they have another trick up their sleeve. What happens to the values yielded by each of the three actions? That’s where the arrows come in. They point to the action whose value is “the chosen one”. So, “a <* b” is a composite action where “a” runs first, followed by “b”, with a’s value being yielded and b’s is ignored. This works quite nicely in this example. Why is this any different to monadic bind? Sure, each sub-parser is independent of it’s mates – the ‘get’ parser has no interest in what the ‘skip’ parser returned – but >> would do that. The important difference is that the values yielded by the parser are only used once – right at the end. In this example, we don’t vary the parsing based on anything we read – although, in general, we might well want to do that. It’s horses for courses; if you want to combine actions but only need to hyperspace one value out, <* and *> are your friends. If you need to combine actions in more complicated ways, >>= is there waiting for you.
    • So far, we’ve glued together the Header function and (the action which produces) its first argument so far. So we need to bake /that/ together with the action for the second argument. That’s what the <*> operator from Control.Applicative gives you. Again, the intuition is that * means sequencing, but this time both the left and right arguments are used to determine the value yielded; specifically via function application.

But now I ask myself: is this readable code, or is it line noise?

Certainly, you have to understand what the four applicative operators do .. and ‘applicative’ sounds scary. However, those operators capture two patterns which occur all the time in parsers. Firstly, a parser yields some final result, based on the data it saw whilst running its component parsers in order (ie. <$> and <*>). Secondly, each part of the returned value corresponds to some subset of the component parsers, although we don’t always use all of the bits we parsed (<* and *>). A parser for a real world file format is going to do these things over and over, so it makes sense to factor the patterns out and give it a (fairly) evocative syntax.

Secondly, type inference is doing a lot of work here. Mostly, I just need to call ‘get’ and the compiler can infer whether I’m expecting to parse a byte or three bytes. It’s all driven off the top level Header type itself. I do worry that this introduces a very strong coupling between the data definitions and the behaviour of the parser. Let’s say I want to elide the third field of the appVersion. If I just change appVersion to be a 2-tuple instead of a 3-tuple, then the parser will still build and run – except it won’t parse the right file format anymore! I need to remember myself to go in and add a “skip 1” into the parser. Sure, type inference is a double edged sword and I can remedy this simply by adding explicit type signatures in appropriate places.

Ho hum, now I need to stop procrastinating and actually parse the files …