February 2005 – Andrew Birkett's blog

I write this blog so that I can look back in years to come and see what I spent time thinking about. Computers have this nasty habit of sucking up time without you noticing. Recently I’ve been doing relatively low-brow stuff. But, for my own benefit, I want to record it so that I don’t redo this again from scratch next year.

I’ve recently been considering hosting my own website. Until now, I’ve paid a hosting company to do it for me, but a few things made me reconsider this. Firstly, I’ve been hosting my own email for a while now, and it has been more reliable that any commercial email provider. I can set up more sophisticated spam filtering this way, and diagnose faults without having to spend hours discussing it with a support department. The server runs on a low power fan-less mini-ITX board in my hall cupboard, and the only downtime so far has been due to me tripping off the power to the entire flat once or twice (even then it auto-reboots … I could plug it into a trickle-charged 12v motorbike battery and have a UPS).

So, hosting my own website would give me much more flexibility. I get hit by an awful lot of blog spam (to the extent where I’ve switched off comments for now). Hosting locally would give me direct access to the database which underpins my blog, which would make it easier to tidy up things. Also, I’d like to have direct access to the webserver logs, which is something my current provider doesn’t give. I’ve got a reasonably fast internet connection to my home which is idle most of the day, and so it seems a bit daft to pay data-transfer costs to a commercial web-hosting company when I’m already paying for unused data-transfer to my home. Finally, I already have a “server” in my cupboard and it could easily take the (light) load of running my website too.

I looked into running the webserver on a user-mode linux machine. It’s effectively like a linux-specific VMware. There were two reasons for this. Firstly, running the webserver on its own machine increases security a bit. If someone used an exploit against the webserver and gained root, I certainly wouldn’t want them to then have access to my email or whatever other services are running on that machine. That’s why I have a seperate server in the first place. UML makes it easy to have a seperate machine for each service you wish to expose, effectively sandboxing them, without buying more hardware. Secondly, running as a UML instance makes backup really easy. UML is really easy to run. You have an executable called “linux-2.6.9” and a second file which is the image for the root disk. You run the executable, and you see a new copy of linux booting within your existing one, mounting the root disk image and leaving you at a login prompt. It doesn’t require you to tweak your existing kernel at all – brilliant. So, to back up that virtual machine you tell it to briefly pause (or shutdown), take a copy of the kernel file and root disk file, and you’re done. My root disk for a Debian 3.0 system running Apache, MySQL and PHP compressed down to about 90Mb. I chose Debian because on a server, unlike on a developer machine where I choose Gentoo, I have no need for bleeding edge software or libraries.

Setting up Apache was easy, even though it’s been years since I last did this. Since I already needed a MySQL database for my blog, I added mod_log_sql to put all the access logs into a MySQL database. This was really overkill. I could see the module being very useful if you had a complicated multiple-VirtualHosts setup. But I was just doing it because I could .. and because I don’t really like Webalizer much. I like the idea of being able to phrase arbitary queries and do some data-mining. Plus, it gave me a chance to refresh my SQL knowledge from University.

There’s something very cute about the way you back up MySQL databases. Most applications, such as word processors, persist their data by writing a snapshot of their current state to disk. MySQL writes out a sequence of commands which, when played back, will rebuild the database. So the start of the dump file will be a “CREATE TABLE …” followed by a series of “INSERT INTO …” lines. This is quite elegant. Why invent an entirely new serialization format when you already have a language which is expressive enough to do everything you need?

Although I don’t deal with databases in my day-job, it’s quite an interesting field in some ways. It’s well accepted that separating data-storage from the rest of your application logic is a wise plan. But SQL-backed applications have a further advantage that, say, an XML-backed application doesn’t have. By making such a clean seperation in your application, you can leave the whole data-storage problem to someone else. There’s lots of really clever people who’ve figured out the best way to store and query big relational datasets – laying them out, and moving them between disk/main-memory/cache-memory in a pretty optimal way. As long as you can fit your data into the right shape, you can then magically take advantage of decades of cleverness. That’s a pretty impressive level of reuse.

On to the last part of the Linux/Apache/MySQL/PHP cluster: PHP. I spent some time looking through the source code for WordPress, my blog software. Blog software ought to be pretty simple. It’s just a glue layer which sucks data out of a database, munges it into HTML and sends it to a browser. But to my eyes, WordPress (and probably most PHP apps) are pretty dire. The code is pretty state-happy, with lots of imperative updating which wouldn’t be needed in a language with better building blocks. It’s a domain where people who think Perl is a fine language (and I mean that in a derogatory way) would be happy. But would I want these people to be writing secure e-commerce sites in this way?! I don’t want to think about that (because I know it’s true). I wasn’t impressed.

So, despite the fact that today I’m writing about setting up webservers, this brings me back to Philip Wadlers Links project. The aim of this project is to take the Good Stuff from the world of research, and apply it to make a Better Way to produce web applications. Whenever I started working with XML, I thought “Great, we have schemas which define the structure of the data .. that means we can integrate that with our language’s static type system”. Hah, no such luck in the Real World … but projects like CDuce are showing the way. Similarly, if you write a web application you need to juggle with the inside-out-ness of the programming model – you can’t just call a function to get some user input, because your code is being driven at the top level by the HTTP request/response model and you always need to return up to the top level. Continuation offer a possible solution to this, as a richer means of controlling the flow of a program, as Avi Bryant’s Seaside framework demonstrates. Today, if you are writing a web application you need to worry constantly about what happens if the user hits the “back” button, or reloads a page, or clicks “submit” twice when they’re making a payment. Perhaps in the future, with better building blocks, these things will come “for free”, and we can wave a fond farewell to a whole class of common web-app bugs.

Web-based applications have lots of advantage (and disadvantages too). I personally really like the “your customers are always using the latest version of the software” aspect. But a lot of today’s web technologies are rooted too much in a perl-hacker mindset. It may be that this is indeed a rewarding place to apply newer programming technologies. I still think the world will not be ready for the Links project for many years to come, but perhaps it will pave the way.

Oh, back to the original story. Having installed everything and got it all working, I flipped my DNS record to that www.nobugs.org went to my home box. But the next morning, I flipped it back. Why? At the end of the day, paying someone about 30UKP a year to host my site is pretty good value. I don’t really to be worrying about my website response time every time I downloading big files over the same link. And if my website ever gets cracked, I’d still rather it was on someone else’s LAN and not mine. Although it might seem like a waste of time to spend hours setting all this up and not use it, I know that I’ve learned lots of useful information and lessons. C’est la vie.

As mentioned before, I am interested in producing a free/non-copyrighted map of Edinburgh. There are several reasons for this, but the main motivation is ideological. Information about the city I live in ought to be free. It’s *our* city. Information about our city ought to be a public asset. The Ordnance Survey keeps a very tight hold on its data, and charges lots of money for it, despite being a department of our own government. This situation is unlikely to change, unless some crazy geeks bypass the whole establishment and produce their own (totally non-derived) map data. That’s the ideology.

The second reason is more pragmatic. If I want to find where someone lives, I can look at multimap. However, if I’m writing trying to write a route-finder computer program then I need the data about roads/junctions in a form which my program can process. Multimap doesn’t help me with this. So, a secondary benefit of producing a map myself is that I can annotate with with metadata (like, streetname, one-way status, steepness of hill) in a form that a computer can understand. Additionally, any other location-related computerized data sources (such as postcode regions, location of wifi hotspots, or pollution measurements) can be meshed together with the map data.

There are three methods that can be used to produce a map. The classical way is to perform laborious ground surveys. That’s soo yesterday! The more modern way is to use satellite imagery or aerial photographs as a starting point, and trace roads/buildings manually or with computer assistance. While high-res satellite imagery is available to the public, it’s expensive and so I discounted that option (for now). The third option, which I’m looking at just now is to use a handheld GPS system and gather trails as I walk/cycle/motorbike around the city.

I wasn’t sure how well GPS would work in the city. There are a number of GPS satellites orbiting around the earth, each broadcasting a time signal. If your handset can see enough of these satellites, it can figure out its longitude and latitude to some degree of accuracy. In the open, accuracy is typically to within 10m, but in a city you often don’t have a good view of the sky and accuracy suffers. An accuracy of 10m doesn’t sound great, but consider that most roads are probably 10m wide so it’s not too bad.

So, I borrowed my brother’s Etrex GPS system and carried it around as I travelled round the city. The GPS handset shows you a graphical view of where you’ve been, and this was enough to confirm that GPS probably did work well enough in the city.

Next step was to get the data onto my PC for processing. GPSbabel took care of downloading the data from the handset into GPX, which appears to be the preferred interchange format for GPS data. I then converted this into the shapefile format, which is a format for vector data commonly accepted by GIS systems. GIS systems are usually hulking great beasts of software, designed to slurp in terabytes of satellite imagery, vector roadmaps, elevation data and the like, and allow you to query it efficiently. However, many GIS systems are obscure and have a steep learning curve. After looking through lots of options, I settled on the JUMP project as being the most hopeful candidate. It happily imported my raw shapefile/GPS data, and I was able to generate a simple map layer from the data and annotated the roads with attributes like “steetname”.

And so …. *drumroll* … here is the beginings of what will hopefully turn into my free Edinburgh streetmap.

Now, there are still some issues to be resolved here. The map data has been sheared at some point on its journey into the JUMP system. If you are familiar with Edinburgh, you’ll know that the roads which join Princes St should all join at right-angles, which isn’t the case in the above map. I imagine that there’s some disagreement about coordinate systems somewhere. The cause will doubtless be blindingly obvious after I’ve figured out what is going wrong, but this is all part of the learning curve.

So, this represents a pretty succesful spike solution. I’ve done a pretty minimal amount of work to establish that the GPS method works, and that the software exists to allow me to make a pretty reasonable map. Now, I might actually start gathering data a bit more seriously, and see about organising a bit of infrastructure to allow other similarly minded people to contribute GPX trails of their own. I’ll also see about integrating SRTM elevation data (which was gathered on a space shuttle mission) to provide height data – although it’s only on a 100m grid, and the presence of tall buildings will cause problems.