DNA Computing

Backyard DNA computing? Yeah, baby. Doesn’t seem too hard.

First of all, what is DNA computing. Aren’t computers made from metal with CPUs and memory? Yes, they commonly are, but that doesn’t mean that all computers must take that form. After all, a computer is just a compute-r, something which computes answers to problems.

I remember my grandad, who worked in the police, telling me how they used to keep details of criminals on filing cards. On the edge of each card there were a number of punched holes, some punched right the way to the edge. Each punched hole corresponded to some fact about the person – whether they were a burglar or a mugger or things like that. So, if you wanted to pick out all the burglars, you’d take a metal skewer and slide it in through the appropriate hole and lift it up. All the cards whose hole had been punched right to the edge (non-burglars) would stay behind, whilst all the relevant ones would get lifted up. Now, while that a million miles away from how modern “computers” work, it still has some flavour of “computing” a solution to a problem.

Now let’s flip to DNA. It’s a long polymer (poly=many, mer=parts) made up from monomers called nucleotides. Nucleotides come in four flavours (called A, T, C and G) and have two neat properties. Firstly, they can join together into big long chains. Secondly, these chains can get zipped together if there are matching nucleotides on each side (A matches with T, G matches with C). It’s the fact that these nucleotides are fussy about who they zip up next to which makes them interesting as computing devices.

If we’re going to try to solve some “computing” problem using DNA, we’ll have to pick a problem which doesn’t require a keyboard or a monitor, since test-tubes don’t come with them attached. The first problem to be solved in this way was what’s called the Hamiltonian Path problem which, despite the formidable name, is really simple. Imagine a small group of islands, connected by bridges. You start on Starting Island, and your aim is to get to Finish Island by crossing bridges, visiting each island exactly once. Now, depending on how many bridges there are, there may be several ways to do this or there may be no way to do it. So, to play the Hamiltonian Path game, someone describes the islands and bridges to you, and you have to either say “here’s a solution” or “there are no solutions”. When there’s lots of islands, it gets really quite Hard.

Fortunately for us, a bright chap called Leonard Adleman figured out how we can cheat at this game by messing around with DNA in testtubes until we find the answer. He’s a pretty bright guy, famous for being the “A” in the RSA encryption system among other things.

The trick goes like this. For each island, we choose a different 20-mer strand of nucleotides (ie. literally “20 parts”; molecular chemists like to pretend they know greek). It doesn’t really matter too much whether you choose ATCGGCTACTGCATGCTGAA or GGGGAAAATTTTCCCCAAAA for a given island, as long as they are different. You do need quite a lot of each strand though – billions and billions of them. And where do you get them from? Well, you can just buy them. They’re called oligonucleotides (more greek, oligos means “few”). Phone them up, say “hey, can you knock me up some GGAACCTT’s and a few billion ATCCGAT’s please?” and they’ll send you out a little white lump.

Here comes the clever bit. We’re going to build strands of nucleotides which represent the bridges. Let’s assume there’s a bridge from Starting Island to Fire Island. Remember how A and T will stick together, and G and C will stick together? These are called complementary pairs. Our bridge-strand is going to be carefully chosen so that its first half will be complementary to the second half of Starting Island‘s nucleotide strand, and the second half of the bridge strand will be complementary to the first half of Fire Island‘s strand. That means it can act like sticky tape to join the two together. That’s pretty cool, and it’s the first hint that something a bit like “computation” is going on here. We’ll need strands which correspond to each bridge, and we’ll need quite a lot of copies of them – Adleman used about 30,000,000,000,000,

So we take our island-strands and our bridge-strands and we put them into a test-tube. We give it a bit of a shake and we’re done. Problem solved!

Ahem, well, the problem has been solved but we need to do a bit of work to reveal the answer. What has happened in the test-tube is that bridge strands have joined the island strand together into big chains, each corresponding to a possible route through the islands. And because we started with so many billion copies of each, and we gave the test-tube a pretty good shake, we can be reasonably sure we’ll have plenty of example of each possible route through the islands in our test tube. Unfortunately, most of these routes will not honour the “only visit each island once” rule. However, it’s not hard to get rid of a lot of the wrong answers by selecting only those DNA strands with the right length. If we had 14 islands (like Adleman did) then we can pick out the DNA strands which are 14*20 = 280 nucleotides long. For this, we can use gel electrophoresis which amounts to blasting the DNA strands down through gel using an electric field to persuade them to move. (As an aside, I had an interesting conversation with someone I’d just met in the pub about why you can’t easily use mass spectroscopy for this).

We’re still not quite finished though. Just because our route visited 14 islands doesn’t mean we visited all 14 islands. Maybe we just spun around in a loop between Starting Island and Fire Island. And also, there’s no guarantee that the routes all started and finished in the right place. Again, biochemistry comes to our rescue. It’s not too difficult to pick out these DNA strands which start with Starting Island‘s sequence and end with Finish Island‘s sequence, thanks to a rather “special” guy called Kary Mullis. He won a Nobel Prize for inventing the PCR reaction, which is a way of selecting particular strands of DNA and making loads of copies of them. As with a lot of great inventions, it’s blindingly obvious with hindsight, but Kary was the first person to conceive of it – largely due to his background in both computer science and chemistry. People had been doing the polymerase reaction for ages (well, we humans had been doing it for millions of years, but we only noticed in the past few decades). The polymerase reaction builds up a (complementary) copy of an exiting DNA strand, so you end up effectively with twice as much. Mr Mullis’s stroke of genius was to do it again. And again. And again. And again, until you disappeared under a mountain of rice and you can’t find your chessboard any more. You should try and borrow a copy of his book, “Surfing Naked in the Mind Field” for a mind-altering view of the world. I think I left my copy in Miami somewhere, but I’m left with the memory of a very unique person. Oh, yeah. Special.

Once we’ve picked out all the strands which start and end in the right place, we go through the remaining pile firstly discarding those which never visited Fire Island, then those which missed out on the delights of Infinite Chocolate Island, and so on until we are left with a somewhat reduced pile of routes/strands. This selection process is done with some glass beads and some molecular glue. If you want to pick out strands containing the sequence AAAA, you can “glue” some TTTT strands (the complement) to some glass beads. Then you pour the solution containg your DNA over the beads, and the AAAA-containing strands will stick to them. Then you scrape them off with a spoon. Well, that’s close enough.

All of the remaining strands represent a route which visited all the islands exactly once. If you had microscopic fingers, you could pick up a strand of DNA and shout “Eureka, I finally have the answer!”. But, if your fingers are human-sized like mine, you will need a final bit of biochemistry to reveal the answer. There’s several different ways to read out the sequence of nucleotides which are left – Adleman used Graduated PCR. But at the end of the day (seven days in Adleman’s case) you end up with answer. Either there was a path through all the islands in which case you’ll have almost certainly found it (as in, really really almost certainly), or there wasn’t in which case you ended up with a sad empty test-tube.

So that’s it. That’s how you can perform a computation in a test-tube using DNA. There’s so much interesting stuff in this area. For a start, just because we can solve an island-hopping problem doesn’t neccesarily mean we can use DNA to solve every conceivable computing problem. It turns out, however, that DNA is that powerful. And, for certain types of problems, it may be a lot quicker than existing electronic computers. After all, you don’t need to buy a faster processor – you just use a bigger testtube! You can’t really use C++ to write DNA programs, so how do you encode an arbitary problem onto strands of DNA? It’s also interesting to look at how the imperfect nature of a biochemical reaction affects the usefulness of the results we get. Our electronic computers aren’t perfect theoretical beasts either – they get hit by alpha particles and have to endure power supply fluctuations. Finally, and this is where my interest really kicks in, can I coin the phrase “backyard dna computing” (cf. bymc) to describe what I’m going to try Real Soon Now -…

Wishlist

Electronic ink makes it into a consumer product – the Sony Librie. Only available in Japan atm, I think, but I’ve been waiting for one of these for a while. They’re selling it with a proprietry drm’d ebook reader on it, but it’s running linux and their source bundle includes what appears to be a framebuffer interface to the hardware. It prompted me to read about DRM technology, to see if there’s any good drm solutions out there in the research world, as opposed to the heavy-handed approaches popular by the big player. Anyhow, with the Librie, it’s the hardware which is tasty – 170dpi 800×600 reflective display, usable indoors and outside. Must find an importer …

Monads by stealth

Common coding idiom: A window sometimes has a tooltip associated with it. A file object may be currently open, and therefore has a current offset, or it may be closed in which case it doesn’t. In both of these cases, maybe we have the extra data or maybe we don’t – it depends on the object’s state.

Mostly in C++, people use pointers for this. The window object contains a pointer to the tooltip object which is valid when the tooltip is relevant, and null the rest of the time. To me, this is a mediocre solution.

Before I go any further, I’d better explain that one of my strong beliefs is that you should be able to get a really good idea of how a class works just by looking at the data members (ie. the state) of the class. You shouldn’t really need to read the implementation. I find this a pretty useful test of code quality and code complexity. Classes which have been prematurely optimised will show up as having more data/state than they Really Need To. Classes which don’t have a single well-defined responsibility will often stand out because they have extra “I’m currenlty in this state” flags. Well-written classes will have distilled their state down to the absolute essentials, and every data item should be able to justify their inclusion.

Furthermore, it’s always a good idea to convey as much of your intent in your source code as possible. If an object owns another object, it is much better to use an auto_ptr than a raw pointer, since that’s their raison d’etre. Spare a thought for the guys who have to read your code six months later, and be as precise as possible with your data types. If the object owns the other object from birth until death, use a const auto_ptr, since that’s all that a const auto_ptr will allow. (Of course, this only works if everyone understands the implications of the idiom. But describing idioms once to people is much easier than having to document the hundreds of places in the code where the idioms use would’ve made things cleaner).

Anyway, back to the “maybe has it, maybe doesn’t” story. Most people use pointers for this. I called this mediocre because using a pointer doesn’t convey the “might be valid, might not” intent. I’ve seen people using a convention of suffixing pointers with 00 to indicate “may be null” – as in “if (window->tooltip00) {..}”. That’s a useful convention, but it’s not something which a compiler can check.

Another common approach to this problem is to define a Maybe (aka Optional) datatype. It’s a simple thing – a Maybe value either currently holds an integer or it’s empty. You can call a method (eg. isPresent() ) to see if it currently holds proper data or not, and call a get() method to retrieve the data.

This has some advantages over using a pointer. For a start, it’s pretty clear this this “maybe data” may be present or may not be present, and so that’s a big win in terms of communication. Also, it doesn’t force you to switch to handling data by-reference like pointers do.

But it’s still not perfect because there’s no way for the compiler to verify that your program only calls get() on maybe-values which have real data. Sure, you can add a runtime check but it would be much nicer if we can verify this property of our program at compile time.

One thing which strikes me is that every place where processing of a Maybe<> value occurs we end up with an if-statement to check whether or not it’s valid. That’s a shame, since it’s basically code duplication all over the shop. Is there a way we could push this duplicate conditional code inside the Maybe<> class itself? Can we take the two bodies of the conditional and pass them to the Maybe<> class to process?

At this point, we’re kinda thinking upside down. Rather than thinking of Maybe as a passive container, we’re going to think of it as the active partner. It knows what it’s internal state is, but it won’t let you at it directly. Instead, the only thing you can do with a Maybe is pass it a couple of function pointers called “do this when you have data” and “do this when you’re empty”. It’s a bit of a shame having to turn the conditional bodies into fully-fledged functions, but that’s pretty much the best we can do in C++. So now the Maybe<> class gains a new method which, being unimaginative, we could call “doStuff” – it just checks whether it has real data, and runs the appropriate argument (ie. a function). Now we’ve managed to factor out all these conditionals.

(Well, it’s a pretty limited idea when we’re using C++ because creating functions is fairly work-intensive. You have to figure out what data they need, and pass it to them. If we used a better language (like, maybe, ocaml!) – which allowed you to treat functions as flexibly as values, creating them on the fly, automatically capturing their context, and passing them around as easily as if they were integers – then it’d be a much more powerful notion).

So, this has been a refactoring of the Maybe<> class from a simple passive data container into a more active participant (albeit a refactoring which can’t be expressed particularly elegantly in C++). It was driven by a desire to more fully express our intent in the source code, and to ensure that we always deal with the two cases (data, and no-data) every time we process a Maybe<> value. But what we’ve done is interesting. We went from a situation where the Maybe’s state was externally visible, to a situation where it was safely wrapped up inside the Maybe object and could only be acted upon by the “special tongs” of the two function we passed to the Maybe<> object.

A lot of work for a little advantage? Nah, now you’re basically playing with a monad, and you never even noticed.

Predicate objects

“Cecil has predicate objects, which allow virtual “subclasses” to be defined carrying specialized methods that override their parent’s methods whenever some boolean predicate over the object is true. In this way, inheritance can be used to model time-varying and/or state-dependent properties of objects just like permanent properties of objects” (from the Cecil homepage)

Now, this is kinda interesting because I was thinking about something a bit like this about a year ago. And furthermore, Cecil has a tasty looking feature list – multimethods, optional polymorphic static typesystem (if you don’t supply static type information, typechecking is done at runtime), closures, and a Self-like classless object system.

And for something completely different. Someone stole the front quick-release from my bike, outside work today. They didn’t take the whole wheel, which was left neatly in place. They just removed the quick release mechanism. How annoyingly strange. Time to arrange a webcam pointing at my bike ..