High Integrity Compilation

The book High Integrity Compilation is keeping my brain engaged as I munch my lunchtime sandwiches. The basic plot concerns writing a bug-free compiler, and the cast consists of the semantics of the high-level language being compiled and the semantics of the target assembly language. Happily, it’s really quite easy going. My previous encounters with formal semantics have generally left me feeling out of my depth. In contrast, this book deals with source/target languages which are just complex enough to show interesting properties but simple enough to pick up immediately.

I’ll try to give an example of why this book makes things easier. The “meaning” of a computer language is usually formally defined by a mapping from “programs” to some kind of mathematical object, like a set. This is useful because we can then leverage all of our mathematics knowledge to learn stuff about behaviour of the program. Now, most books on semantics start with a paragraph like “we’d like to use boring old primary-school sets as our mathematical objects, but they’re not powerful enough”. And before you know it, they’ve starting using domain theory and fixed-points as if they were as common as making a cup of tea. Around this point, I usually head over to slashdot and never return. However, the “High Integrity Compilation” book happily doesn’t disappear down this road. It stays safely up in the land of simple maths and easy state machines. The languages under consideration are simple imperative ones, and so there’s not the same focus on recursive functions that you get with ML-like languages and so you’re not immediately hit by a wall of complexity on page four of the book.

This book also dutifully walks through useful examples, working up from simple three-liners into multi-page walkthroughs. There’s a refreshing lack of the words “obviously”, “clearly” and “trivially” and instead plenty of concrete worked examples.

Of course, this means that the resulting compiler isn’t going to really change the world. The source and target languages are chosen for pedagogical reasons rather than real-world usefulness. But this means that, for someone like me learning on their own, you can quickly get a good grounding in semantics without having to deal with a whole load of picky little uninteresting details.

Crashes of one sort or another

In a world where technical subjects are often made needlessly complex through the use of jargon, it’s always refreshing to come across examples of clear thinking and good communication. This time, it’s the serious subject of the report into the causes of the Concorde crash. After the inital horror of the crash itself, there was clearly a need to understand and respond to the causes of the crash (those who ignore history are doomed to repeat it). Apart from being a remarkable insight into the workings of Concorde, what is clear from reading through the report is that it is based around evidence rather than conjecture. Whereas a media report can simply state “it is thought likely that a piece of metal caused a tyre explosion” and be sufficient for its purpose of informing the masses, a formal crash report must stand up to higher levels of rigour. If there was a piece of metal on the runway, there must be a plane out there missing a piece of metal. You need to find that plane and carefully examine it, If you think a piece of metal could cause Concorde’s tyre to explode, you need to get an identical tyre and try running it over an identical bit of metal under realistic conditions to see what can happen. Simply saying “this could have happened” is not enough – you need to demonstrate that the theory is consistent with known events, and you need to carefully put together a jigsaw puzzle, justifying each and every piece.

Probably the most important ingredient for a retrospective investigation like this is the evidence trail, leading backwards into time like the roots of a tree. The design and manufacture of Concorde is well documented, as are the maintenance schedules and details. The operation of the airport on that day is well documented, and even the chance video footage of the accident itself provides information which can be matched up with telemetry and blackbox recordings.

Reading the report reinforced the feeling I had when I read the Ariane 5 report and the Challenger report: We can learn from our mistakes only if we preemptively leave enough crumbs of information along the way.

We’re human, and we screw up pretty regularly. If we accept that we’re going to make mistakes, we can at least plan for this eventuality. By making good choices today, we can minimize the impact and cost of our future mistakes and ensure that we can learn from them.

I don’t design aeroplanes for a living, so it’s time to connect this, as ever, back to the world of software. All our applications contain bugs of one sort or another. Sometimes they crash, sometimes they give wrong answers, sometimes they just lock up and stop responding. They always have done and they always will do. If we accept that this is true, what can we usefully do in response?

Let’s look at crashes first. Crashes are bad primarily because they usually imply some degree of data-loss or state-loss. Perhaps you loose a few pages of the book you were writing. Or perhaps you’d spent ages getting your “preferences” just right, and the app crashed without saving them. Or maybe the application crashes every time you try to print, and you really need to print!

Unlike with aeroplanes, it’s easy to restart the application after a crash. But it’s the data loss which is painful to users. So the First Law of Applications ought to be “never loose a users data, or through inaction allow a user to loose data”. It’s not acceptable to rely on the user to manually save their work every few minutes. It’s not acceptable to trust that the user will quickly do a “Save as..” after starting work on a new document instead of working away for hours in the default Untitled document. It should be the application’s responsibility to ensure that the user’s work is safe in the event of a crash. So, it should be regularly saving the users work, settings and undo/redo queue (and any other state) into a secret file somewhere. In the event of a crash, the application should take care of restarting itself and pick up where it left off.

Continuing with the aviation analogy, a plane doesn’t just have two possible states – “everything fine” and “crashed”. Planes are complex beasts, with engines, hydraulic systems, electrical systems. It is almost certain that something will start to go wrong with some part of the plane eventually. So, the designers also build in fault-detection systems and maintenance schedules. There are systems for detecting unusually high temperatures, excessively low tyre pressures, problems with the electrical system and so on. Furthermore, there are system for checking that those systems are working correctly (if your smoke alarm isn’t working, you don’t have a smoke alarm).

Most applications also contain fault-detection systems. An ASSERT() statement is a little self-check inside a program. If states some property which should always be true (eg. a person’s weight must be a positive value). If the assertion fails, something is badly wrong. A good application will have assertions sprinkled liberally all over the codebase – a little army of sanity-checks, watching out for something going wrong.

However, for some crazy reason, most people only enable these self-checks during inhouse development. When they ship the application, they disable these checks. This might be reasonable behaviour if the software industry had a track record of detecting every single bug before shipping, but that is obvious not the case. Even the best tested application in the whole world will throw up bugs once it’s out in the field, perhaps running on an old Win98 machine, set to Icelandic language, with a slightly unusual mix of recent and out-of-date system DLLs, and a copy of some overactive anti-virus software.

If there is a bug, and you have assertions enabled, you are quite likely to detect the problem early and be able to gather useful information about the application’s current state. With assertions disabled, the same bug will probably allow the application to stumble on for a few more seconds, gradually getting more and more out of control until it crashes in a tangled heap, far away from the original problem area.

Aeroplane designers don’t take out their fault-detection systems once the plane enters service. Neither should application developers. The performance hit is negligable in the face of the increased long-term reliability gains.

Most warning systems in an aeroplane exist because some remedial action can be taken. There’s no “wing has fallen off” warning, because there really isn’t anything you can do to recover from that. What remedial action could a software system take, instead of just stopping when a fault is detected? Actually, if crashing doesn’t loose any user data, then crashing is not too serious. Restarting the app probably gets us back to a known good state. One idea, called microreboots (part of crash-only software) is to reduce the granularity of the “restart”. For example, if a crash occurrs in the GUI part of the app, it may be possible to restart only that section of the program, leaving the underlying data unaffected. Or if a crash occurs on printing, it would be acceptable to abort the printing operation and restart only that chunk of the app.

This is easier in some language than others. In C++, the unregulated access to memory means that the “printing” part of the application can easily scribble over the memory belonging to another subpart of the application. In strongly typed languages (whether dynamic or statically typed) this is not a problem Furthermore, a NULL pointer access in C++ (which is a very common cause of crash) is not recoverable. It does not get turned into an exception, like in Java, and cannot really be dealt with usefully by the application. Given these constraints, it seems to me that a C++ program can only use microboots by splitting the application across multiple processes and using “processes” as the reboot granularity. Furthermore, the application needs to deal gracefully with the fact that subparts may fail and not carry out a request.

Even if you don’t go all the way towards microreboots, you can still use the ideas to flush out bugs from your application. In the classic model/view design pattern, all of the state resides in the model. So, you should be able to pull down the whole view system and restart it whenever you choose. You could, quite possible, do this every minute on a timer. Any bugs caused by the view irresponsibly caching state may well be flushed out by this technique. If you can’t do this in your application, why not?

Let’s revisit the aviation analogy one more time. Recommendation R7 in the Ariane 5 crash report says “Provide more data to the telemetry upon failure of any component, so that recovering equipment will be less essential”. That’s a posh way of saying it’s easier to analyze the blackbox recordings than it is to figure out what happened by piecing together debris.

When an application crashes, the first things the developers want to know is “what were you doing when it crashed?”. If you can reliably remember what you were doing in the leadup to a crash, you’ve got a rare talent. Most people can give a general description of what they were trying to achieve at the time, but only rarely do people remember the exact sequence of events leading up to the crash.

So why rely on people’s memory? The solution is to have the application itself gather the information itself. When an app dies, it actually has plenty of time to collect together a crashdump (or minidump) which details exactly what the program was doing at the point it crashed – what code was running, what the contents of the memory was. It can also explain to the user what has happened, invite them to email the crash report back to the developers, restart the application and resume from where it left off. If a developer receives a crashdump they can usually see immediately what the application was doing at the time it crashed.

Furthermore, the application can have a “blackbox recorder” which keeps track of recent activity. It can record a high-level summary of activity, like “12:10 Opened file “foo.txt”, 12:11 Edit/Select All menu item chosen” etc. If the application subsequently crashes, it can add this summary of recent activity into the crash report. This more-or-less removes the need for users to explain what they were doing.

I like to think of this as proactive debugging. Normally, debugging occurs after-the-fact – something has gone wrong and you need to figure out why. If you adopt the “proactive debugging” mindset, then think “at some point, I’m going to have to track down a bug in this app, so what kind of scaffholding and information would make that task easier?”. And then you add that in ahead of time. As the project develops, and you learn more about what kinds of bugs are common, you can tune your monitoring systems to pick up these problems as early as possible.

I don’t think there’s much chance that developers will stop writing buggy code any time soon, so we may as well concentrate effort on building a better net to catch the bugs once they’re there.

Method & Creativity

During performance work, I find myself alternating between thinking “yup, that all seems reasonable” and “oh gosh, what’s going on there?”. It’s like some kind of crazy jigsaw puzzle.

I find performance work quite relaxing and satisfying. I think there are several reasons for this. Firstly, it is very methodical work. It goes like this:

1. Establish which particular aspects of your application are too slow
2. Establish what would consistute “fast enough”.
3. Create easy-to-run test cases for each area that needs speeding up (eg. application startup time).
4. Run a profiler to see where your application is actually doing during that time.
5. Based on that evidence, make changes to the application.
6. Repeat until things are fast enough.

It is very methodical, but enjoyable too. When you are identifying the performance bottlenecks, you have powerful tools (such as Intel’s Vtune) at your disposal and they do most of the hard work for you. It is always very satifying to use good tools.

These tools produce a large amount of data, and you have to put on your “investigation” hat to intepret the raw data. I enjoy this phase, partly because I know that all the relevant information is available to me, and partly because it lets you see your application from a different angle. It’s like exploring a landscape, building up a map of an area that you only vaguely knew before. I am always very familiar with the static structure of the applications I work on (what each bit of the code does, and how they fit together), but it’s only when I am doing performance work that I look at the big-picture dynamic structure of the application.

During this investigation phase, some of the facts which reveal themselves are expected and familiar – for example, you would expect a game would spend a lot of its time drawing graphics. These points act as signposts to me – known markers in the large mass of information. But then when you continue drilling down, you often hit surprises. Beforehand, you probably had suspicions as to where the time was being spend. I have to say that my suspicions have always been consistently wrong – to the point that I no longer bother thinking about them too much .. I just run the profiler. During performance work, I find myself alternating between thinking “yup, that all seems reasonable” and “oh gosh, what’s going on there?”. It’s like some kind of crazy jigsaw puzzle. Sometimes weird stuff shows up; people accidentally leave bits of code in the application which shouldn’t be there. Mostly however, the code is basically fine, but with your performance-tuning goggles on (and the benefit of hindsight) you can see how things can be restructured to make them more efficient.

Once you’ve explored around for a while (a fine opportunity to drink some tea) you end up with probably a few areas which you think could be improved. Now comes the creative side of performance work, because at this stage you have lots of choices. You could make the program do the same amount of work, but faster. You could make the program avoid having to do so much work. You could delay the work until later, waiting for a time where a delay wouldn’t be so important, and possibly never having to do the work at all! There are other angles to approach this. It is often not the absolute performance of your application which is critical – it is the user’s _perception_ of the performance which matter. This is why an application will display a progress bars during a long operation. This is why we have splash screens during application startup. This is why lift engineers put mirrors in lobbys near to the lifts. People hate having to wait, and they perceive and application which makes them wait as being slow.

So, the “make things appear less slow” stage can involve a huge range of techniques, from low-level assembler optimizations, through design changes, right up to HCI tricks. You have a real opportunity to pull out your toolbox and go to work. But at all stages, you can always go back and test how well you’ve solved the problem. It’s a lot easier than, for example, choosing which new features to add to an application.

Right, that’s nearly the end. This turned out not to be so much about profiling itself, but more about my reaction to the task. I’m not sure why I’ve recently turned my attention away from technology towards my reaction to technology. Partly it is triggered by the title of a famous paper by Dijkstra, “Programming considered a human activity”. Yes, I am still strongly interested in technology and I think there are great gains still to be made by improving our technology. But at the same time, it is we human beings who use the technology, and we who are affected by its results. Software projects are carried out by humans who get bored sometimes, excited sometimes, and are most definitely not robots. A software methodology which treated team members as interchangable automata is doomed to failure, because it would crush the spirit of each team member. But on the flip side, you need some structure in order to harness the team’s energy and coordinate their efforts. I think the best gains are to be made by adopting methods which amplify and harmonize the efforts of individuals, rather than focusing on process and expecting individuals to execute that process. I think a useful first step is to be aware nof your own rational, emotional and physical reactions to the work you do, and try to avoid the nasty “distraction field” which seems to operate around computer. Oh look, another superficially interesting article on slashdot! There goes another few minutes of my life!

Fixing computers / Dynamic linking

Who says you can’t fix things any more? (more photos here).

Sitting at work waiting for our C++ application to link reminds me that just because your language can do lots of stuff at compile time doesn’t mean you necessarily want it to do so. Usually with a C++, the linker goes through your code figuring out what address each function will live at, and making each function-call jump to that address. You do almost all the linking work upfront. In contrast, other languages are lazier – they don’t bother doing any work until it’s actually needed, which makes for faster edit/compile cycles because you’re not doing the work upfront. Actually, often these languages forced into being lazy because the information isn’t available at compile time. But contrapositive doesn’t apply: if you have the information you don’t have to be eager. A C++ compiler could delay figuring out which function to call until the first time the function was called. You’d also need to keep all the type information around until runtime, but you’re probably already doing that for debugging purposes anyway. You’d be moving one phase of the compiler from compile-time to run-time.

You might think it would all be much easier to do this stuff by starting with a more dynamic language. but then with “dynamic-C++” you still maintain the benefit of having static typing checking for silly mistakes. Which brings me full circle to a paper which was on LtU recently which discusses this very topic (and which I have yet to read!).

Ah, dynamic languages vs. static languages. Here I am again.

Webcams

I got a Creative Labs NX Pro webcam, but unfortunately there aren’t linux drivers for it yet. I did find a project which is developing a driver for zc030x-based webcams. They’ve managed to grab a still image from a webcam, so that’s hopeful. However, the driver only builds under linux 2.4 at the moment and I’m using 2.6.

First of all, I did a bit of reading about the kernel usb support.

  • There’s a skeleton usb driver (in drivers/usb/usb-skeleton.c). Under Documentation/DocBook there’s an article about the USB system.
  • drivers/usb/media/dabusb.c is quite a clean looking driver
  • USB programmers guide
  • Loads of great articles about porting drivers to the 2.6 kernel

I read through the existing code for the zc030x driver, and used Freemind to organise my notes about it.

Since then, I’ve crashed my computer more times that I can remember. Writing kernel code is very unforgiving. If you make a mistake, the whole computer will probably hang. I’ve ended up adopting a very different, even-more-paranoid-than-usual coding style, and have aliased insmod/rmmod to automatically run sync beforehand! But I’ve been making progress – the driver compiles, loads, creates the /dev file correctly, and responds to incoming data .. but I don’t think the webcam is being initialised properly yet. I need to do more snooping of the datastream under Windows to understand what needs to be sent when.

USB itself has been really nicely designed. The designers have obviously considered real use-cases and designed for these. For example, there are four different modes of data transfer, including isochronous (regular time interval, eg. for a webcam running at 15fps) and bulk transfer (eg. for a usb disk, which wants to use as much bandwidth as is available). When there are multiple devices plugged in to your PC, the USB system itself is responsible for arbitrating who is allowed to use how much bandwidth. The usb system knows that it’s better to drop a webcam frame entirely than to deliver it really late. On the other hand, it knows that it should stop files being copied to my usb storage stick if the webcam wants to send a frame. All of this happens behind the scenes, as far as a programmer is concerned. USB is not just an end-to-end pipe, like rs232 was. There’s a whole bandwidth/device management layer on top of that.

Oh, finally .. learning a new code always takes a bit of effort, but it can be made easier by using decent tools. Xref is a refactoring/navigation tool for C. If you’re working on a C program with emacs, and you’re not using this already then you must enjoy pain. Note that you don’t want to be browsing the kernel headers in /usr/include – you need to use the ones under /lib/modules/version/build. Here’s the options for xref to set this up.

[webcam]
  //  input files and directories (processed recursively)
  ~/Projects/zc0302/
  -D__KERNEL__
  -resetIncludeDirs
  -I /lib/modules/2.6.5-gentoo/build/include/linux
  -I /lib/modules/2.6.5-gentoo/build/include
  -I /lib/modules/2.6.5-gentoo/build/
  //  directory where tag files are stored
  -refs ~/Xrefs/webcam
  //  number of tag files
  -refnum=10