Decentralized monitoring

Occasionally, one of the three computers in my flat will crash or hang in some way. When this happens, it’d be nice to be notified. For example, if the MythTV host crashes it’ll stop recording any TV programs until I notice and kick it. So I could just install one of the many network monitoring packages out there. But as far as I can see, they all want me to choose one ‘master’ host and have the other machines report their statistics into the master. That sounds awfully oldskool. Which of my three machines should I pick as ‘master’? They’re all equally likely to go down, and when the master goes down, I won’t get any notifications.

What I really want is a “neighbourhood watch” scheme, where each host watches the others. Or, in posher language, I want a decentralized monitoring solution. Here’s how it’d work. Each host would announce ‘I’m still running’ or ‘My load is 50%’ every few seconds, and the other hosts will listen and record the message to disk. If any one machine bursts into flames, the other two should still have a complete record (including the fact that the failed host stopped talking at 15:30).

Ideally, hosts should be able to add themselves to the party without much hassle. Announcing the messages via IP broadcast would fit the bill, but it’d only work on one LAN segment. Also, the hosts might end up being fairly busy if they have to handle status updates from all other hosts all the time.

Sounds like an ideal application of gossip protocols. When a node first joins the monitoring party, it needs to be told about at least one other participating node. The nodes can subsequently exchange occasional ‘gossip’ messages with a few of their neighbours, telling them about the state of the world. The gossip message can also contain information about other nearby participants. In this way, information gradually spreads out around the network. Network partitions are a minor pain, but so long as there’s at least one other host still around to notice any explosions, they’ll still be noted.

This seems to fit the bill nicely: decentralized, unlikely to lose data, and robust to hosts appearing and disappearing.

So that’s the monitoring and data recording side dealt with (ha, no lines of code written!). But what about the alerting? If one host goes down, I don’t really want N other hosts all to start spamming me with duplicate notifications. I guess neighbourhood watch scheme suffer this problem – everyone in the street hears a window smash and the police get twenty phone calls! But can a decentralized mob temporarily appoint one of their number as a leader in order to send out a single notification? Yes, that’s exactly what distributed ‘election’ algorithms are designed to do. If all the hosts can talk to each other, they’ll manage to agree on a single leader. If there’s been a network partition, at worst we’ll get one leader in each partition and so get a few notifications (ie. the right hand says ‘I can’t see the left hand!’ whilst the left hand says ‘I can’t see the right hand!’).

There’s maybe a bit of tension here. You could write a gossip protocol where a node only keeps track of a few neighbours at any given time. But, for an election, you’ll need to know about everyone in the network. And even in a gossip system where nodes do eventually learn about the whole network, there’s going to be a startup period where the new node does know about everyone. Hmm, but I guess the election process might be able to handle this if it can gather together all the partial knowledge about who’s in the election. Ah, more detailed reading required!

Surely this system already exists? Well, the closest I could find was GEMS which doesn’t appear to handle the single-notification problem. All very interesting stuff though!