Crashes of one sort or another – Andrew Birkett's blog

In a world where technical subjects are often made needlessly complex through the use of jargon, it’s always refreshing to come across examples of clear thinking and good communication. This time, it’s the serious subject of the report into the causes of the Concorde crash. After the inital horror of the crash itself, there was clearly a need to understand and respond to the causes of the crash (those who ignore history are doomed to repeat it). Apart from being a remarkable insight into the workings of Concorde, what is clear from reading through the report is that it is based around evidence rather than conjecture. Whereas a media report can simply state “it is thought likely that a piece of metal caused a tyre explosion” and be sufficient for its purpose of informing the masses, a formal crash report must stand up to higher levels of rigour. If there was a piece of metal on the runway, there must be a plane out there missing a piece of metal. You need to find that plane and carefully examine it, If you think a piece of metal could cause Concorde’s tyre to explode, you need to get an identical tyre and try running it over an identical bit of metal under realistic conditions to see what can happen. Simply saying “this could have happened” is not enough – you need to demonstrate that the theory is consistent with known events, and you need to carefully put together a jigsaw puzzle, justifying each and every piece.

Probably the most important ingredient for a retrospective investigation like this is the evidence trail, leading backwards into time like the roots of a tree. The design and manufacture of Concorde is well documented, as are the maintenance schedules and details. The operation of the airport on that day is well documented, and even the chance video footage of the accident itself provides information which can be matched up with telemetry and blackbox recordings.

Reading the report reinforced the feeling I had when I read the Ariane 5 report and the Challenger report: We can learn from our mistakes only if we preemptively leave enough crumbs of information along the way.

We’re human, and we screw up pretty regularly. If we accept that we’re going to make mistakes, we can at least plan for this eventuality. By making good choices today, we can minimize the impact and cost of our future mistakes and ensure that we can learn from them.

I don’t design aeroplanes for a living, so it’s time to connect this, as ever, back to the world of software. All our applications contain bugs of one sort or another. Sometimes they crash, sometimes they give wrong answers, sometimes they just lock up and stop responding. They always have done and they always will do. If we accept that this is true, what can we usefully do in response?

Let’s look at crashes first. Crashes are bad primarily because they usually imply some degree of data-loss or state-loss. Perhaps you loose a few pages of the book you were writing. Or perhaps you’d spent ages getting your “preferences” just right, and the app crashed without saving them. Or maybe the application crashes every time you try to print, and you really need to print!

Unlike with aeroplanes, it’s easy to restart the application after a crash. But it’s the data loss which is painful to users. So the First Law of Applications ought to be “never loose a users data, or through inaction allow a user to loose data”. It’s not acceptable to rely on the user to manually save their work every few minutes. It’s not acceptable to trust that the user will quickly do a “Save as..” after starting work on a new document instead of working away for hours in the default Untitled document. It should be the application’s responsibility to ensure that the user’s work is safe in the event of a crash. So, it should be regularly saving the users work, settings and undo/redo queue (and any other state) into a secret file somewhere. In the event of a crash, the application should take care of restarting itself and pick up where it left off.

Continuing with the aviation analogy, a plane doesn’t just have two possible states – “everything fine” and “crashed”. Planes are complex beasts, with engines, hydraulic systems, electrical systems. It is almost certain that something will start to go wrong with some part of the plane eventually. So, the designers also build in fault-detection systems and maintenance schedules. There are systems for detecting unusually high temperatures, excessively low tyre pressures, problems with the electrical system and so on. Furthermore, there are system for checking that those systems are working correctly (if your smoke alarm isn’t working, you don’t have a smoke alarm).

Most applications also contain fault-detection systems. An ASSERT() statement is a little self-check inside a program. If states some property which should always be true (eg. a person’s weight must be a positive value). If the assertion fails, something is badly wrong. A good application will have assertions sprinkled liberally all over the codebase – a little army of sanity-checks, watching out for something going wrong.

However, for some crazy reason, most people only enable these self-checks during inhouse development. When they ship the application, they disable these checks. This might be reasonable behaviour if the software industry had a track record of detecting every single bug before shipping, but that is obvious not the case. Even the best tested application in the whole world will throw up bugs once it’s out in the field, perhaps running on an old Win98 machine, set to Icelandic language, with a slightly unusual mix of recent and out-of-date system DLLs, and a copy of some overactive anti-virus software.

If there is a bug, and you have assertions enabled, you are quite likely to detect the problem early and be able to gather useful information about the application’s current state. With assertions disabled, the same bug will probably allow the application to stumble on for a few more seconds, gradually getting more and more out of control until it crashes in a tangled heap, far away from the original problem area.

Aeroplane designers don’t take out their fault-detection systems once the plane enters service. Neither should application developers. The performance hit is negligable in the face of the increased long-term reliability gains.

Most warning systems in an aeroplane exist because some remedial action can be taken. There’s no “wing has fallen off” warning, because there really isn’t anything you can do to recover from that. What remedial action could a software system take, instead of just stopping when a fault is detected? Actually, if crashing doesn’t loose any user data, then crashing is not too serious. Restarting the app probably gets us back to a known good state. One idea, called microreboots (part of crash-only software) is to reduce the granularity of the “restart”. For example, if a crash occurrs in the GUI part of the app, it may be possible to restart only that section of the program, leaving the underlying data unaffected. Or if a crash occurs on printing, it would be acceptable to abort the printing operation and restart only that chunk of the app.

This is easier in some language than others. In C++, the unregulated access to memory means that the “printing” part of the application can easily scribble over the memory belonging to another subpart of the application. In strongly typed languages (whether dynamic or statically typed) this is not a problem Furthermore, a NULL pointer access in C++ (which is a very common cause of crash) is not recoverable. It does not get turned into an exception, like in Java, and cannot really be dealt with usefully by the application. Given these constraints, it seems to me that a C++ program can only use microboots by splitting the application across multiple processes and using “processes” as the reboot granularity. Furthermore, the application needs to deal gracefully with the fact that subparts may fail and not carry out a request.

Even if you don’t go all the way towards microreboots, you can still use the ideas to flush out bugs from your application. In the classic model/view design pattern, all of the state resides in the model. So, you should be able to pull down the whole view system and restart it whenever you choose. You could, quite possible, do this every minute on a timer. Any bugs caused by the view irresponsibly caching state may well be flushed out by this technique. If you can’t do this in your application, why not?

Let’s revisit the aviation analogy one more time. Recommendation R7 in the Ariane 5 crash report says “Provide more data to the telemetry upon failure of any component, so that recovering equipment will be less essential”. That’s a posh way of saying it’s easier to analyze the blackbox recordings than it is to figure out what happened by piecing together debris.

When an application crashes, the first things the developers want to know is “what were you doing when it crashed?”. If you can reliably remember what you were doing in the leadup to a crash, you’ve got a rare talent. Most people can give a general description of what they were trying to achieve at the time, but only rarely do people remember the exact sequence of events leading up to the crash.

So why rely on people’s memory? The solution is to have the application itself gather the information itself. When an app dies, it actually has plenty of time to collect together a crashdump (or minidump) which details exactly what the program was doing at the point it crashed – what code was running, what the contents of the memory was. It can also explain to the user what has happened, invite them to email the crash report back to the developers, restart the application and resume from where it left off. If a developer receives a crashdump they can usually see immediately what the application was doing at the time it crashed.

Furthermore, the application can have a “blackbox recorder” which keeps track of recent activity. It can record a high-level summary of activity, like “12:10 Opened file “foo.txt”, 12:11 Edit/Select All menu item chosen” etc. If the application subsequently crashes, it can add this summary of recent activity into the crash report. This more-or-less removes the need for users to explain what they were doing.

I like to think of this as proactive debugging. Normally, debugging occurs after-the-fact – something has gone wrong and you need to figure out why. If you adopt the “proactive debugging” mindset, then think “at some point, I’m going to have to track down a bug in this app, so what kind of scaffholding and information would make that task easier?”. And then you add that in ahead of time. As the project develops, and you learn more about what kinds of bugs are common, you can tune your monitoring systems to pick up these problems as early as possible.

I don’t think there’s much chance that developers will stop writing buggy code any time soon, so we may as well concentrate effort on building a better net to catch the bugs once they’re there.

A small note about pratical matters of logging.

All that data is great but it has to be managed correctly, if you need it to
switched on in the field.

We have a third party database application that has a log level of 0-255.
Anything below 255 is worthless, and 255 provides reams. However I loath to
turn it on the majority of users as it produces a log file of ~500mb for a
mornings work. Which is not too bad in it self but there is something of
Shlemiel the painter going on
(http://www.joelonsoftware.com/printerFriendly/articles/fog0000000319.html). By
the mid morning the users interface has slow right down, to the point of
uselessness, solely due to logging.

Additionly some users aren’t comfortable about finding files on their computer
to be transferred across the network, or compressing and e-mailing them. You
can’t tell at a user a remote site “Stop work, I’ll come and collect the log
file, and by the way don’t restart the application as it will wipe it. I should
only be about an hour.”.

The end result of this is that we don’t have the logs switched on by default, and
thus there not available when required.

In short you have to manage your data collection, retention, and transport
processes at every level. Dumping all your information to a text file for later
collection does constitute a “black box”.

3 replies on “Crashes of one sort or another”