Crashes of one sort or another

In a world where technical subjects are often made needlessly complex through the use of jargon, it’s always refreshing to come across examples of clear thinking and good communication. This time, it’s the serious subject of the report into the causes of the Concorde crash. After the inital horror of the crash itself, there was clearly a need to understand and respond to the causes of the crash (those who ignore history are doomed to repeat it). Apart from being a remarkable insight into the workings of Concorde, what is clear from reading through the report is that it is based around evidence rather than conjecture. Whereas a media report can simply state “it is thought likely that a piece of metal caused a tyre explosion” and be sufficient for its purpose of informing the masses, a formal crash report must stand up to higher levels of rigour. If there was a piece of metal on the runway, there must be a plane out there missing a piece of metal. You need to find that plane and carefully examine it, If you think a piece of metal could cause Concorde’s tyre to explode, you need to get an identical tyre and try running it over an identical bit of metal under realistic conditions to see what can happen. Simply saying “this could have happened” is not enough – you need to demonstrate that the theory is consistent with known events, and you need to carefully put together a jigsaw puzzle, justifying each and every piece.

Probably the most important ingredient for a retrospective investigation like this is the evidence trail, leading backwards into time like the roots of a tree. The design and manufacture of Concorde is well documented, as are the maintenance schedules and details. The operation of the airport on that day is well documented, and even the chance video footage of the accident itself provides information which can be matched up with telemetry and blackbox recordings.

Reading the report reinforced the feeling I had when I read the Ariane 5 report and the Challenger report: We can learn from our mistakes only if we preemptively leave enough crumbs of information along the way.

We’re human, and we screw up pretty regularly. If we accept that we’re going to make mistakes, we can at least plan for this eventuality. By making good choices today, we can minimize the impact and cost of our future mistakes and ensure that we can learn from them.

I don’t design aeroplanes for a living, so it’s time to connect this, as ever, back to the world of software. All our applications contain bugs of one sort or another. Sometimes they crash, sometimes they give wrong answers, sometimes they just lock up and stop responding. They always have done and they always will do. If we accept that this is true, what can we usefully do in response?

Let’s look at crashes first. Crashes are bad primarily because they usually imply some degree of data-loss or state-loss. Perhaps you loose a few pages of the book you were writing. Or perhaps you’d spent ages getting your “preferences” just right, and the app crashed without saving them. Or maybe the application crashes every time you try to print, and you really need to print!

Unlike with aeroplanes, it’s easy to restart the application after a crash. But it’s the data loss which is painful to users. So the First Law of Applications ought to be “never loose a users data, or through inaction allow a user to loose data”. It’s not acceptable to rely on the user to manually save their work every few minutes. It’s not acceptable to trust that the user will quickly do a “Save as..” after starting work on a new document instead of working away for hours in the default Untitled document. It should be the application’s responsibility to ensure that the user’s work is safe in the event of a crash. So, it should be regularly saving the users work, settings and undo/redo queue (and any other state) into a secret file somewhere. In the event of a crash, the application should take care of restarting itself and pick up where it left off.

Continuing with the aviation analogy, a plane doesn’t just have two possible states – “everything fine” and “crashed”. Planes are complex beasts, with engines, hydraulic systems, electrical systems. It is almost certain that something will start to go wrong with some part of the plane eventually. So, the designers also build in fault-detection systems and maintenance schedules. There are systems for detecting unusually high temperatures, excessively low tyre pressures, problems with the electrical system and so on. Furthermore, there are system for checking that those systems are working correctly (if your smoke alarm isn’t working, you don’t have a smoke alarm).

Most applications also contain fault-detection systems. An ASSERT() statement is a little self-check inside a program. If states some property which should always be true (eg. a person’s weight must be a positive value). If the assertion fails, something is badly wrong. A good application will have assertions sprinkled liberally all over the codebase – a little army of sanity-checks, watching out for something going wrong.

However, for some crazy reason, most people only enable these self-checks during inhouse development. When they ship the application, they disable these checks. This might be reasonable behaviour if the software industry had a track record of detecting every single bug before shipping, but that is obvious not the case. Even the best tested application in the whole world will throw up bugs once it’s out in the field, perhaps running on an old Win98 machine, set to Icelandic language, with a slightly unusual mix of recent and out-of-date system DLLs, and a copy of some overactive anti-virus software.

If there is a bug, and you have assertions enabled, you are quite likely to detect the problem early and be able to gather useful information about the application’s current state. With assertions disabled, the same bug will probably allow the application to stumble on for a few more seconds, gradually getting more and more out of control until it crashes in a tangled heap, far away from the original problem area.

Aeroplane designers don’t take out their fault-detection systems once the plane enters service. Neither should application developers. The performance hit is negligable in the face of the increased long-term reliability gains.

Most warning systems in an aeroplane exist because some remedial action can be taken. There’s no “wing has fallen off” warning, because there really isn’t anything you can do to recover from that. What remedial action could a software system take, instead of just stopping when a fault is detected? Actually, if crashing doesn’t loose any user data, then crashing is not too serious. Restarting the app probably gets us back to a known good state. One idea, called microreboots (part of crash-only software) is to reduce the granularity of the “restart”. For example, if a crash occurrs in the GUI part of the app, it may be possible to restart only that section of the program, leaving the underlying data unaffected. Or if a crash occurs on printing, it would be acceptable to abort the printing operation and restart only that chunk of the app.

This is easier in some language than others. In C++, the unregulated access to memory means that the “printing” part of the application can easily scribble over the memory belonging to another subpart of the application. In strongly typed languages (whether dynamic or statically typed) this is not a problem Furthermore, a NULL pointer access in C++ (which is a very common cause of crash) is not recoverable. It does not get turned into an exception, like in Java, and cannot really be dealt with usefully by the application. Given these constraints, it seems to me that a C++ program can only use microboots by splitting the application across multiple processes and using “processes” as the reboot granularity. Furthermore, the application needs to deal gracefully with the fact that subparts may fail and not carry out a request.

Even if you don’t go all the way towards microreboots, you can still use the ideas to flush out bugs from your application. In the classic model/view design pattern, all of the state resides in the model. So, you should be able to pull down the whole view system and restart it whenever you choose. You could, quite possible, do this every minute on a timer. Any bugs caused by the view irresponsibly caching state may well be flushed out by this technique. If you can’t do this in your application, why not?

Let’s revisit the aviation analogy one more time. Recommendation R7 in the Ariane 5 crash report says “Provide more data to the telemetry upon failure of any component, so that recovering equipment will be less essential”. That’s a posh way of saying it’s easier to analyze the blackbox recordings than it is to figure out what happened by piecing together debris.

When an application crashes, the first things the developers want to know is “what were you doing when it crashed?”. If you can reliably remember what you were doing in the leadup to a crash, you’ve got a rare talent. Most people can give a general description of what they were trying to achieve at the time, but only rarely do people remember the exact sequence of events leading up to the crash.

So why rely on people’s memory? The solution is to have the application itself gather the information itself. When an app dies, it actually has plenty of time to collect together a crashdump (or minidump) which details exactly what the program was doing at the point it crashed – what code was running, what the contents of the memory was. It can also explain to the user what has happened, invite them to email the crash report back to the developers, restart the application and resume from where it left off. If a developer receives a crashdump they can usually see immediately what the application was doing at the time it crashed.

Furthermore, the application can have a “blackbox recorder” which keeps track of recent activity. It can record a high-level summary of activity, like “12:10 Opened file “foo.txt”, 12:11 Edit/Select All menu item chosen” etc. If the application subsequently crashes, it can add this summary of recent activity into the crash report. This more-or-less removes the need for users to explain what they were doing.

I like to think of this as proactive debugging. Normally, debugging occurs after-the-fact – something has gone wrong and you need to figure out why. If you adopt the “proactive debugging” mindset, then think “at some point, I’m going to have to track down a bug in this app, so what kind of scaffholding and information would make that task easier?”. And then you add that in ahead of time. As the project develops, and you learn more about what kinds of bugs are common, you can tune your monitoring systems to pick up these problems as early as possible.

I don’t think there’s much chance that developers will stop writing buggy code any time soon, so we may as well concentrate effort on building a better net to catch the bugs once they’re there.

Cuckoo pointers

A cuckoo pointer is a particularly nasty kind of bug to track down in a C/C++ program. I’ve only encountered two of them in the eleven or so years I’ve been using these languages, but when they occur it can be very difficult to understand what is going on. A cuckoo pointer starts life as an old fashioned dangling pointer – a reference to a object on the heap which has been deallocated. It becomes a cuckoo pointer if you are unlucky enough to allocate an object of the same type, at the same address as the old one. Got that? You’ve got some buggy code which hasn’t realised that the old object is now dead, but now suddenly down plops a new object right in its place.

Normally dangling pointers aren’t too tricky to track down. If you write to memory using one, you’ll almost certainly spectacularly trash a bit of memory and your program will die. But in the cuckoo pointer scenario, the write to memory will probably kinda make sense. If the heap object was a string, the cuckoo pointer will be doing string-like operations to the memory, and your program will probably continue running for a while.

You might argue that the chances of a new object of the same type being allocated at exactly the same address are pretty slim. That’d be true, up to a point, but if your heap implementation is using a block-allocator strategy, the chances shoot right up. A block allocator means that allocations of blocks of 8 bytes will come from a special pool of contiguous 8-byte blocks. There’s a seperate pool for each size of blocks – 2, 4, 8, 16, 32 – up to some maximum size. This is a fast way of allocating memory, because the regular structure means that only minimal bookkeeping is required. But under this scheme, the chances of an new object landing at the same address as a recently-freed object of the same type are much increased.

Why are cuckoo pointers so hard to track down? It is because your application will continue executing for a while after the initial error (the dangling pointer). The application will potentially crash a good while after that error, although at least the crash is likely to be in an object of the same type. It’s not too unlike a data race, where two threads are writing to the same bit of memory. Here you have two objects – one real, one not – existing in the same bit of memory.

How do you track down cuckoo pointers? Well, the first step is to realise that they exist! Then you can switch on the heap’s “do not reuse free’d memory blocks” option and see if that changes the behaviour of the program at all.

Words

I like discovering the story behind words. A few months, I realised that “mer” was greek for “parts” and so “polymer” and “monomer” just meant “many parts” and “one part”. Words which come from other languages, and any “native” words, must themselves have ultimately just been made up by someone at some point. The problem is that the birth of most words is not recorded. You might be able to guess why the word was chosen (like, “anteater”) but I think it’s unusual to be able to pinpoint exactly when a word first entered the language.

I’ve been reading a recent biography of Michael Faraday, who did a lot of important early work in electromagentism, and from this book I read the following historic nugget. By 1831, a lot had been discovered about the nature of electromagnetism, but the language used to describe the phenomena hadn’t caught up. People were often using analogies to water – they talked of “electrical fluid – but this analogy could be confusing. Faraday started a correspondance with Revd William Whewell at Cambridge Uni, which I’ll paraphrase:

FARADAY: “I think we need some names for the terminals of a battery. I’ve came up with: exode, zetode, zetexode but I also think westode and eastode are pretty catchy too”

WHEWELL: “Hmm, that’s a bit of a mouthful. How about anode and cathode? Pretty solid greek background for those words”

FARADAY: “Cheers Will, but the guys down at the pub weren’t very convinced by anode and cathode. They laughed at my poor use of greek”

[ FARADAY goes away and writes a paper using the terms DEXIODE and SKAIODE instead ]

WHEWELL: “Let me give you a quick crash course in greek, and you can tell your mates where to stick their criticisms”

FARADAY: “Y’know Will, I think you might just be right after all”

And that’s how the words “anode” and “cathode” first entered the language of electricity. That’s why we have “cathode ray tubes”. Not content with that, Faraday and Whewell went on to add the words “ion”, “dielectric” and also “diamagnetic” and “paramagetic” to the language, all terms which are still used today when describing electricity and magnetism.

The Future

I always liked Technetcast and now I’m listenting to it’s spiritual descendent, IT Conversations. It’s a collection of presentations and lectures from various conferences around the world, all in mp3 format. So I copy them onto my mp3 player, and I get to listen to Steve Wozniak telling his life story while I’m at the gym, or Stephen Wolfram talking about physics as I cycle somewhere.

This is the kind of thing that The Future always promised when I was younger. I can have video chats with my family over the internet. I can listen to audio lectures anywhere I want on a tiny portable device. I can check my email using GPRS halfway down a mountain biking trail. I can watch a degree level course on electromagnetism from MIT whenever I want, from the comfort of my own flat.

Okay, so we don’t have teleporters yet. And hard AI didn’t do so well. But, all things considered, the Future is doing pretty well.

Extreme Extreme

A while ago, I had the idea of always giving your customers a specially instrumented version of your product. The application would know what lines of code had been executed during your testing phase, and whenever the user ran a bit of code which hadn’t been tested pre-release, a dialog box would popup saying “Hey, you’re running code which we didn’t bother to test!”. How would you feel about doing that for your application?

Anyway, I stumbled across Guantanamo today which takes that one step further. It runs your unit tests, find any lines of code which haven’t been executed during the tests *and deletes them* from your source code. If they’re not being test, they’re not going allowed to exist. Heh, how cool is that?