Yet another Compiler Compiler – Andrew Birkett's blog

If you are ever foolish enough to wake one day and think “I’m going to write a compiler for the FooBar language”, you will soon find yourself well acquainted with the “FooBar Language Specification” document. In theory, this document tells you exactly what a program can look like, and how all the bits work. When you start writing the compiler, you’ll spend a lot of time checking little details in this document.

Boring, huh? You have the Holy Grail of Software Engineering (a complete and accurate specification!) but yet you’ll still need to read it, chew over it for a while, and then spend a good few months whacking keys like a little code monkey. If you watch someone write a compiler, they’ll read a bit of the specification, make some changes to the compiler, then repeat ad nauseum. What thought processes are going on in their head to convert this specification document into a compiler? Can’t we get the computer to do all the hard work, so we can spend our time doing more fun things?

Let’s look at our raw materials. The definition of a language is a quite a hard thing to specify. It has to be very precise, so that compiler writers have a clear understanding of how the language works. Compiler writers have a hard job, especially dealing with all the corner cases and edge conditions. One of these days, someone using their compiler is going to try to write a program which uses a hellish mixture of overloading, inheritance, template specialization and inline assembly — and they’ll (quite reasonably) expect that the compiler is going to handle it and keep on smiling.

A precise specification is not necessarily a good specification though. Just think of the instructions you get with flat-pack furniture. They do tell you exactly what to do – “Insert bolt (1) through part (2), using nut (3)” – but they don’t convey the spirit of the procedure. It reminds of the discussion in Zen and the Art of Motorcyle Maintenance where the author describes an analytical description of a motorbike (“a motorbike consists of a power assembly and a running assembly. The power assembly consists of .. yada, yada”). Such a description manages to tell you what a motorbike is, without ever conveying what a motorbike is. Hmm, just go read the book – it had a big effect on my life.

Uhh, let’s get back on course. Languages like “C” were created for pragmatic reasons – to get the job done (writing unix). Later, when the language got more popular it became important to create a formal specification, so that you could guarantee that each “C compiler” on the market will behave in the same way. So, while there is a language specification document for C, it is somewhat of a retrofit. Other languages, such as ML, have progressed along with their specification from early days.

If you read through the specification document for Java or C or C++ you’ll quickly notice that it’s written in english. Certainly, it’s a technobabble version of what you or I speak, more reminiscent of the ZATAOMM quote above than anything you’ll hear in the pub. But, it’s english nonetheless. And, as anyone with a longterm partner will know, english is an ambiguous language. This is just asking for trouble. Compiler writers loose sleep over phrases like “should not” and “must not”. Even worse, when you try to use english in a very precise way, you risk making it a complete impediment to understanding.

Besides, our original plan was to get a computer to write a compiler for us. Computers aren’t very good at understanding english, so we’re unlikely to have much success with languages where the specification is written in engrish.

Fortunately, some clever language designers have thought to use something better to describe their language. For example, the creators of Standard ML use a formal notation called “natural semantics” to describe the meaning of ML programs. Cunningly, you have to buy their book to see it (or live near to a Uni library), but this notation enables them to elegantly and concisely describe the whole of the language in a single slim booklet.

This particular notation (“natural semantics”) is really just another language – one suited to describing the behaviour of programs. We rarely try to use english to discuss calculus or algebra (unless there’s no blackboard nearby) because mathematical notation is much more precise and concise. It’s the same situation with discussing programming languages. Sure, you have to invest some time in learning the language, but once you’ve done that you can communicate effectively and precisely about the meaning of programming languages. Once you’ve learned to read natural semantics, you can spent your winter evenings reading your way through “natural semantics for Java”, “natural semantics for C++”, and so on …

Natural semantics isn’t the only such “language” for describing the meaning of programming languages. There’s also operational semantics, denotational semantics, action semantics and probably lots more. So you could have “denotational semantic for Java” and “operational semantics for Java” and they’d both tell you what Java is.

Given that you can describe the behaviour of Java using any of these notations, you might wonder why you’d pick one over the other. The difference is that one notation is particularly good if you’re building a compiler, while another notation might be particularly good if you’re trying to prove properties of a program (like, “does it do what I hope it does?!”).

A “denotational semantics for C” describes the behaviour of C by mapping each part of the C language onto a mathematical object called a domain. Don’t worry about the details – just note that once you’ve got such a mapping set up, you’ve got a huge toolbox of mathematical techniques available to probe your language with. For example, you probably have an intuitive notion that the code “i++; i++;” is pretty much identical to “i += 2;”. Denotational semantics is the ideal tool for putting these intuitive notions on a more formal footing. Unfortunately, having all these mathematical objects and theorems floating around isn’t getting you much closer to having a compiler for the language.

In constrast, an “operation semantics for C” would describe the behaviour of C programs on some sort of hypothetical computer – probably a fairly simple one. For each construct in the C language, it would describe the transition from the initial state of the machine to the resulting state of the machine. This is a pretty reasonable way of defining a language, especially given that these are programming languages and mostly we’re interested in writing compilers for them. It’s important to choose the “hypothetical computer” carefully. You probably don’t want to choose a Real World computer, since that would make it hard to build compilers for other platforms. But you also don’t want to make the hypothetical computer too abstract (like a turing machine, or lambda calculus) because your description wouldn’t convey the spirit of the language very effectively.

If you were writing a compiler for a language, you’d probably find the “operational semantics” quite helpful. Basically, you’d just need to decide how to efficiently implement the operations of the “hypothetical computer” on your target machine, and the rest is easy!

But wait! We don’t want to write the compiler by hand. Let’s code up that knowledge (the mapping from “hypothetical computer” instructions to “target computer” instructions) into a program, and have it slurp in the “operational semantics” definition of the Java language (which is a mapping from “java source code” to “hypothetical computer” instructions). Hey presto, we got a java compiler/interpreter!

What’s more, we can take that same program (a “compiler compiler”)and feed it the “operational semantics for C++”. Hey presto, instant C++ compiler!

This sounds great. I’m describing a world where language designers write a formal (accurate and complete) description of their language, and we can instantly build a compiler for it. Surely it’s too good to be true? Yeah, of course it is. This kind of thing does actually work to a certain extent (shock, horror) but it’s got big difficulties. For a start, a naive translation will result in a hideously inefficient compiler – it’ll generate correct code, but that code will run very slowly. And, regarding correctness, we’ve moved the goalposts – our compiler compiler had now better be bug free, or we’ll have big problems. That’s just the tip of the iceberg. There’s a lot more work required in this field before you can throw away your copy of gcc.

So, that’s the end of this minor epic. To summarize: we can describe the “meaning” of programming languages using a variety of notations – english, operational semantics, denotational semantics etc. Each of these flavours of notation are suited to a particular task. Some make it almost possible to generate a compiler for the language direct from the specification, eliminating the costly and bug-ridden process of having humans write compilers.