Parsing binaries, applicative stylee

I’ve been writing a parser for a random file format. As usual, I spent more time playing around with simple functions than anything else. But along the way I learned more about Control.Applicative, and in particular how to write use it to write super-concise Data.Binary parsers.

My file format starts with header containing some “magic” bytes, then an app version (1 byte) and a data version (3 bytes) separated by NULLs. The parsed header will have type Header, as follows:

data Header = Header { dataVersion :: Word8, appVersion :: (Word8,Word8,Word8) } deriving Show

magic = (0x5::Word8, 0x6::Word8,0x7 :: Word8)

I’m using Data.Binary, and the parser is just this …

getHeader :: Get Header
getHeader = do 
  fail "Magic word bad" `unlessM` ((== magic) <$> get )
  Header <$> (skip 1 *> get <* skip 1) <*> (get <* skip 3)

So there’s a lot of cool stuff going on here!

  1. To parse the “magic” bytes, I really wanted perl’s “die unless ok” idiom. I couldn’t find anything in the standard library, so I wrote my own “unlessM”. Used with backticks, we end up with code which looks like “die_action `unlessM` everything_is_fine”. The implementation, pointlessly concise is just:

    unlessM :: Monad m => m () -> m Bool -> m ()
    unlessM m b = b >>= (`unless` m)
  2. The ‘get’ action on the first line yields a triple of Word8’s (thank you type inference). The “(== magic)” section compares it with the expected value. To apply the function to the result of the action, we could use liftM however Control.Applicative provides the nicer <$> – reminiscent of normal function application using ($) but, here, in a monadic setting. By the time we’ve plugged those two together, we get a Bool-yielding action .. exactly what unlessM wants.
  3. After parsing the “magic bytes”, we need to skip a byte, grab the app version byte, skip a byte, grab the three data-version bytes, and skip a final 3 bytes .. and then return a Header value. All of this is done in a single line:

    Header <$> (skip 1 *> get <* skip 1) <*> (get <* skip 3)

    Reading from left to right, this says:

    • We’re going to return a value built using the Header constructor, a function which takes 2 arguments.
    • Header’s first argument comes from a skip-get-skip action. More on this next, but note that <$> is just function-applied-to-result-of-computation which we’ve seen already in line 1.
    • To grok the skip-get-skip action, first imagine that the *-operators mean simply “and then..”, ie. the three actions are sequenced. In practise, they have another trick up their sleeve. What happens to the values yielded by each of the three actions? That’s where the arrows come in. They point to the action whose value is “the chosen one”. So, “a <* b” is a composite action where “a” runs first, followed by “b”, with a’s value being yielded and b’s is ignored. This works quite nicely in this example. Why is this any different to monadic bind? Sure, each sub-parser is independent of it’s mates – the ‘get’ parser has no interest in what the ‘skip’ parser returned – but >> would do that. The important difference is that the values yielded by the parser are only used once – right at the end. In this example, we don’t vary the parsing based on anything we read – although, in general, we might well want to do that. It’s horses for courses; if you want to combine actions but only need to hyperspace one value out, <* and *> are your friends. If you need to combine actions in more complicated ways, >>= is there waiting for you.
    • So far, we’ve glued together the Header function and (the action which produces) its first argument so far. So we need to bake /that/ together with the action for the second argument. That’s what the <*> operator from Control.Applicative gives you. Again, the intuition is that * means sequencing, but this time both the left and right arguments are used to determine the value yielded; specifically via function application.

But now I ask myself: is this readable code, or is it line noise?

Certainly, you have to understand what the four applicative operators do .. and ‘applicative’ sounds scary. However, those operators capture two patterns which occur all the time in parsers. Firstly, a parser yields some final result, based on the data it saw whilst running its component parsers in order (ie. <$> and <*>). Secondly, each part of the returned value corresponds to some subset of the component parsers, although we don’t always use all of the bits we parsed (<* and *>). A parser for a real world file format is going to do these things over and over, so it makes sense to factor the patterns out and give it a (fairly) evocative syntax.

Secondly, type inference is doing a lot of work here. Mostly, I just need to call ‘get’ and the compiler can infer whether I’m expecting to parse a byte or three bytes. It’s all driven off the top level Header type itself. I do worry that this introduces a very strong coupling between the data definitions and the behaviour of the parser. Let’s say I want to elide the third field of the appVersion. If I just change appVersion to be a 2-tuple instead of a 3-tuple, then the parser will still build and run – except it won’t parse the right file format anymore! I need to remember myself to go in and add a “skip 1” into the parser. Sure, type inference is a double edged sword and I can remedy this simply by adding explicit type signatures in appropriate places.

Ho hum, now I need to stop procrastinating and actually parse the files …

Model Checking

Model checking has been on my list of “stuff to understand” for a long time. It’s one of the few ‘formal methods’ which gets noticeable positive press. This month’s Communications of the ACM had a big article about it, of which I understood very little. However, it prompted me to search t’internet and this introduction to model checking which, shockingly, has an actual worked (ableit trivial) example. Now I am enlightened!

Talks @ University

This morning, I did an guest lecture for the Advances in Programming Languages course at Edinburgh University. I’ve talked at the University before, as part of the SPLS seminars. So, when I met Ian Stark again at ICFP and he invited me to talk to his class, I jumped at the chance.

Partly, I just love doing talks. But, specifically, I thought it would be a good opportunity to share some of the stuff I’ve learned over the last 12 years since I left uni. I’ve interviewed lots of graduates over the years, and I’ve seen how they adapt to working in industry. I’ve seen grads grow into world class software developers, and I’ve observed some of the common traits and interests that the best of them have.

And so my talk split into three themes. Firstly, I explained why I think learning languages is a worthwhile use of one’s precious/finite time. Secondly, I explored which languages give you the biggest bang-for-buck in terms of expanding your world view. And, finally, I used erlang as a concrete example to explore both the design pressures that influence a language, and to ‘cherrypick’ the key ideas from the language.

My aim was not to convince anyone to use Erlang day-to-day. My aim was to demonstrate that you can expand your mind by seeking out new ideas. Much like riding a bicycle, once you’ve seen a new way to look at the world you never forget it. And Erlang, operating in the difficult realm of high availability software, was full of good examples.

It’s pretty cool to be asked back to the University you graduated from, and to have the chance to share your thoughts with the next generation of students. Thanks to Ian Stark and David Aspinall for inviting me back to the Uni.

Royal Institution Christmas Lectures, take one

When I visited the Royal Institution in London recently, there was obviously something ‘going on’ in the building. After checking out the official exhibit in the basement, I explored the building by following the staircase up past an interesting succession of portraits. I could hear a professional sounding talk coming from somewhere, and saw several stressed stagehands running from a room packed with scientific props to a door leading to a backstage area.

This, then, would be the world-famous Royal Institution Christmas Lectures! They’re now being shown on TV as I speak. I didn’t want to get in the way of the stagehands and their precious cargo, and so I beat a hasty retreat back downstairs.

However, I noticed today that the RI website says that the christmas lectures are held at 6pm. But I visited there at around 3pm. So how come I managed to overhear a lecture?

Turns out, these slick tv productions don’t “just happen”. I found an article by a previous xmas lecturer which explain the painful reality. Each lecture is preceded by at least a gruelling day and a half of rehearsals and planning. It’s more like a stage show than a simple lecture – the cameras, sound guys, lighting and stagehands all need to figure out what they’ll be doing and when they’ll be doing it. And the lecturer needs to figure out where to look, who to talk to *and* remember their words! Seems like working with children & animals is the least of their troubles.

Having watched the first lecture on TV now, I’m left wondering how they got two donkeys up to the lecture theatre. Did they walk up the stairs? Do they have a lift – and, if so, would you get into a lift with two donkeys?!

And so I leave the Royal Institution, home of Michael Faraday and Humphry Davy, by contemplating the deepest question of science: Donkeys and staircases. Staircases and donkeys.

Technical debt (or, mortgages in Haskell)

I recently got fed up trying to understand my mortgage using excel. After twenty minutes guddling with cells and individual values, I felt the need to create higher-level abstractions such as “mortgage” and “payment strategy”. I also wanted to create a list of possible repayment strategies and easily compare them to see how it affects the loan duration and total interest payed. This is possible in excel, but no fun.

So, fast-forward to the end of an evening’s hacking with Haskell. I now have hmortgage, a EDSL for expressing payment strategies and code which will expand out a mortgage into monthly steps, like this:

We are looking at loan of £1000.00 at 5.0% over 10y, which has required monthly payment of £10.58
        Total interest: £272.97 Total payments: £1272.97 Duration=10y 1m
Overpayment scenario "2 pm, 200 initial":
        Total interest: £132.09 Total payments: £1132.09 Duration=6y 3m
        Compared to baseline: interest=£-140.88, payments=£-140.88, duration=-3y 10m
For month 1, balance: £1000.00 -> £791.58       (interest: £4.16, payment: £212.58)
For month 2, balance: £791.58 -> £782.29        (interest: £3.29, payment: £12.58)
For month 3, balance: £782.29 -> £772.96        (interest: £3.25, payment: £12.58)
For month 4, balance: £772.96 -> £763.60        (interest: £3.22, payment: £12.58)

ie. if you overpay by £2 each month, and pay an initial lump sum of £200, you’ll save about £140 overall and will repay the mortgage nearly 4 years early.

There’s a few points of haskelly interest in this code, mostly inspired by stuff I read a few years ago – behaviors in FRP, and SPJ’s “composing contracts” paper.

Combinators for payment strategies

I have a few primitive payment strategies, which can be combined into more complex strategies:

  • monthlyPaymentsOf (100 Pounds)
  • lumpSumOf (100 Pounds)
  • lumpSumOf (100 Pounds) `after` (1 Year)
  • monthlyPaymentsOf (100 Pound) +. (lumpSumOf (100 Pounds) `after` (1 Year))

Shallow embedding of DSL

The dsl is a shallow embedding; it represents the monthly payment plan as a function from month-number to the payment amount, ie. Integer -> Currency. There’s a problem with this approach – the only thing you can do with a function is apply it to some arguments. This is fine for finding the payment for a particular month, but I would also like to derive a textual description of the payment plan – which isn’t possible with functions.

From stuff I’ve read previously, I think my two options are:

  1. Lisp-like: Represent the payment schedule as data (ie. like an AST) and provide an eval function. This allows introspection into structure of the payment schedule. Code is data, data is code.
  2. Arrow-like: The payment strategy could be a tuple of the function and a textual description. When strategies are combined, the combinator would merge the textual descriptions as well as producing new combined functions. I’m not totally convinced that the english language is ‘compositional’ in this way though – it might end up with really clumsy phrasing.

Crazy Lennart-inspired postfix operators

Initially, the only way I had to create a ‘Currency’ value was via the ‘pounds’ function. In haskell, the function precedes the argument, hence it looks like “pounds 20”. The source code would read nicer if I could write this as “20 pounds” like we do in english. I didn’t think this was possible in Haskell.

Then I remembered seeing Lennart Augustsson’s crazy embedding of BASIC into Haskell. In particular, he had code which looked like this:

runBasic $ do
  20 END

How the heck does that parse? It’s using ‘do’ notation, so “20 END” must have a type in the Monad class. But, as I understood things, “x y” means “apply the (function) value x to value y”. And “20” doesn’t look much like a function to me.

Digging into the source, I found this:

-- 10 END
instance Num (END -> Expr a) where
    fromInteger i c = ...

Hmm, interesting. This is saying that (some) function type can be treated as if it is “number like” and provides a mechanism for converting integer literals in source code to that type. I hadn’t fully appreciated this, but the Haskell Report says that numeric literals aren’t quite as literal as I expected – the literal integer value gets passed through ‘fromInteger’ and can therefore be made into any Numeric type.

So this code really says “Hey ghc, if you come across a “42” in the source code, you can turn that into a function if you need to”. In the BASIC example, the next thing on line 20 is “END”, a constructor for the type also called END. So, ghc will be looking to turn “42” into something that can be used as a function taking an argument of type END, and so it’ll call this instance of fromInteger.

Hurrah, I can use the same ‘trick’ to make my currencies look nicer:

data MONEY = Pounds | Pence

instance Num (MONEY -> Currency) where
  fromInteger i Pounds = C (i * 100)
  fromInteger i Pence = C i

Now I can say “42 Pounds” or “23 Pence”. The “42” will become a function with type MONEY -> Currency. The “MONEY” type is really just a tag – used to choose the parse but that’s it. The Pounds/Pence tags force the appropriate overloading of fromInteger to be chosen, and this will construct a Currency value (represented as number of pence, and using a simple wrapper constructor called C).

Is this better, or just “clever”? I’m not sure yet. It’s certainly easier to read. But I feel I’ve taken a step away from “pure haskell” into a slightly weird world. Still, if I were writing in lisp, I wouldn’t think twice about doing this kind of thing.

The actual app

Shocker, I’ve produced an app which is actually useful to me in “teh real world”. I have a big TODO list of stuff which will fit nicely into the app – time-varying interest rates, inflation predictions and NPV calculations. None of which, of course, I will ever actually get around to adding. But it’s still useful in its present state, so a win!

Here’s what the “summary” view says – it omits the montly breakdown and instead reports the overall savings possible via the different payment strategies:

We are looking at loan of £1000.00 at 5.0% over 10y, which has required monthly payment of £10.58
        Total interest: £272.97 Total payments: £1272.97 Duration=10y 1m
Overpayment scenario "2 pm, 200 initial":
        Total interest: £132.09 Total payments: £1132.09 Duration=6y 3m
        Compared to baseline: interest=£-140.88, payments=£-140.88, duration=-3y 10m
Overpayment scenario "2 pm only":
        Total interest: £216.50 Total payments: £1216.50 Duration=8y 1m
        Compared to baseline: interest=£-56.47, payments=£-56.47, duration=-2y
Overpayment scenario "200 initial":
        Total interest: £163.52 Total payments: £1163.52 Duration=7y 8m
        Compared to baseline: interest=£-109.45, payments=£-109.45, duration=-2y 5m
Overpayment scenario "400 initial":
        Total interest: £87.73 Total payments: £1087.73 Duration=5y 6m
        Compared to baseline: interest=£-185.24, payments=£-185.24, duration=-4y 7m
Overpayment scenario "200 after 2y":
        Total interest: £191.42 Total payments: £1191.42 Duration=7y 10m
        Compared to baseline: interest=£-81.55, payments=£-81.55, duration=-2y 3m
Overpayment scenario "400 after 2y":
        Total interest: £137.90 Total payments: £1137.90 Duration=5y 10m
        Compared to baseline: interest=£-135.07, payments=£-135.07, duration=-4y 3m

Eep, it’s 01:30 .. how did that happen? Stoopid jetlag …