Compilers galore

Just released, gcc 3.4 and a free download of the MS C++ compiler. Go figure.

I’ve just got one of these (the thing on the right, not the coke can) which is now quietly humming away running my mail server. It is currently sitting on a few bits of wood in my cupboard (the best kind way to silence a PC – put it somwhere else and knock holes through walls for the cables) but maybe I should do something like this with it?

I’ve twice recently had cause to refer to this article on “upside-down inheritance”, which discusses a very flexible method of composing functionality together, without some of the problems of a purely multiple-inheritance based approach.

(Oh, I forgot to add: this weblog is now syndicated over at LiveJournal)

Pretty Printing Parsing Problems

Ah, I’ve returned to the Real World after a week of snowboarding fun. Culture shock, muchly.

I haven’t written much recently. That was’t due a lack of stuff to write about, but more due to me doing a little bit of this, then a little bit of that, and never really doing anything in large enough chunks to write cohesively about. Still, I want to play catchup so that in a years time I can remember where I spend my efforts.

My rough, sprawling plan is to finally write my own dream development environment. I’ve pondered lots about what this would look like, but what liberated me was deciding to wave a metaphorical middle finger at the rest of the world. This is going to be *my* dream development environment. I’ll make the tradeoffs which are sensible for me, and I don’t care if anyone other than me finds the end result useful. I’m waving goodbye to ascii source code. CVS, diff, grep and all the other text-based tools won’t be relevant any more. I’ve nothing against these fine tools, but the Big Number One tradeoff I’m making is to base the whole system around an abstract representation of source code rather than a concrete ascii representation. This isn’t a straightforward tradeoff to make. There are as many downsides as there are upsides, but after years of playing around with development tools I think I can tolerate the downsides in order to enjoy the upsides.

This development system will be targeted for, and written in ocaml – because it’s the finest language I’ve encountered. A few years ago, I looked through the gcc sources and spend weeks trying to untangle how they worked. In contrast, when I recently tried to use the lexer, parser and typechecker from the ocaml compiler in a stand-alone test application, it took me about 20 minutes to get it working.

So that’s the plan anyway, in the vaguest sense of the word. It opens the door to all sorts of interesting problems.

How will I edit the code then? Well, to get started, I’m going to pretty-print the abstract representation of each module back into (you guessed it) ascii source code, and use good old emacs! That’s the quick and easy way to get going. But the crucial point is that it’s not the *only* way to edit code in this system. I will also be able to apply semantic transformations (like rename) direct to the abstract source tree. I can write my own display routines to show the code using whatever crazy display or edit method I can conceive of. The raw information is all there ready to be used.

Previously, when I thought about writing a development environment I’d spend ages wondering about how to keep maintain the “original sources”(ie. the ascii files) when the user was busy applying refactorings. Well, I fixed that problem by getting rid of the ascii files! I’ll import my legacy sources into the system once and from that day on, I’ll be working with vastly improved tools and I’ll forget all about files. If I want to distribute an ascii-source version to someone else, I’ll pretty-print one.

“It won’t inter-operate with people’s existing toolsets”, you say. “Noone will buy it”, you say. You’re totally right. But I don’t care. I’m writing this for me, because it’s a Much Better Way to write software. Then, uhh, I’m going to take over the world, uhh, or make tea, or something … 😉

(I had an discussion once with Michelle once where I said that if you believe software patents are wrong, but you’re in a situation where the company you are running is going to go bust unless you pursue some patents, then the Right Things To Do is to stick to your principles and have the company go bust. I don’t often have many views which are strongly one way or the other – most of the time I end up arguing both sides – but when I do get round to strongly believing in something, I’m fairly solid in my beliefs. It annoys me on a daily basis that we monkeys keep ourselves busy with inferior, inefficient coding tools when if we used something better, we could spend more time outside enjoying the sun)

So, onto the concrete software problems! I was trying to find elegant solutions to pretty-printing of source code. There’s lots of literature on how to parse source code – that is, going from the crummy ascii-based representations which we humans have traditionally favoured with all it’s implicit conventions, into a clean structured representation. Pretty-printing, the opposite problem, gets less attention .. I guess because fewer people ever need to do it.

There’s two sides to the pretty-printing problem. One is how to lay the source code out on the page – where to put line-breaks and whitespace. The second is how to reconstruct the “optional” presentation elements – like, putting parenthesis in expressions where needed, and even sometimes adding extra ones where they’re not absolutely needed if that will make it easier for me to read.

Fortunately, this is an area where we people have been showcasing the power and elegance of functional programming languages such as ocaml and Haskell. I found papers describing a pretty printing method and an “unparsing” method which not only describe concise elegant solutions, but also prove all sorts of neat and useful properties of the algorithms. Tasty.

With pretty-printing up and running, I can choose what size of “slice” I want to edit – a single function, a whole module, or maybe a method and all its callees. Then I edit this by hand, and run a parser on it, and then integrate it back into the abstract source tree. If I’ve pretty-printed using the standard ocaml syntax, I can just reuse the standard ocaml parser. I could make up my own syntax too, but then I’d have to write a parser as well as a pretty printer. (There’s a bit of overlap with camlp4 obviously).

Anyhow, for me this is a “burn the diskpacks” project. I am cherry-picking the relevant bits from lisp and smalltalk, mixing them together with ocaml and getting rid of historical baggage like ascii files where necessary. As I said earlier, the real breakthrough was deciding to appoint myself as the single target user. And also, being arrogant enough to believe that most development tools and languages today really do suck beyond belief. Heh, having said all that .. watch me loose interest in computers for the next six months! 😉

Hardware read/write breakpoints

Intel processors have four hardware breakpoints, which means you can drop down into the debugger when someone writes *or reads* from a memory address. This is pretty handy when you’re working with crufty languages like C++ and you have to track down memory corruption. Of course, you could use Purify/Boundschecker but here’s a cheaper and often faster solution.

Include the header hwbreak.h in your project, and create a HWBreak object either on the stack or heap (depending on how long you want it to live for). There’s a quick example in the comments. Now when the memory at that address is modified or read, you’ll drop down into the debugger.

(After I wrote this, I found another webpage which describes a similar approach, whose URL I’ve now lost. The difference is that I jump through hoops to ensure that I only change the thread context on a suspended thread, since the API docs warn against doing it on a running thread).

Now, DevStudio already gives you access to break-on-write breakpoints via the “data breakpoints” pane – you enter WO(0x12345678) to break if the WOrd at that address is modified. But there’s no way to get break-on-read breakpoints from within DevStudio. Well, in a vanilla DevStudio. But if you’re stubborn you can hack the binary to use read-write breakpoints instead of write-only ones. Details to follow …

Optimization

This is an idea which has been growing on me for a few years. I’ve finally manage to nail some of the concepts down a bit. I’m dreaming of a world where programs are elegant, concise and obviously correct. In particular, I’m asking:

“Why do programs have to get more complicated when you try to speed them up?”

If you want to speed up your program, you usually end up rewriting parts of it to make them go faster. Typically, you use better algorithms and data structures, add caching, or maybe rewrite an inner loop in assembler. The resulting code is probably more complicated, harder to understand, harder to modify and more likely to contain bugs. You’ve made a performance gain, but you’ve suffered a heavy loss in code quality.

It’s depressing to start with a clean elegant piece of code and then desecrate it in the name of performance. Aesthetics aside, optimization also makes code brittle. It’s harder to fix bugs in “optimized” code because it’s less clear what’s going on.

In my ideal world, I would be able to write an elegant, clear version of the code, and then seperately specify a transformation which can be applied to my elegant version to produce a faster version. This would allow me to express my intentions much better. I can say “look, this is what I’m really doing” and, seperately, I can say “here is how I make it run fast enough”.

If I were writing code to draw a straightline on a bitmap, my clean version would calculate the position of each pixel independently using “y = mx + c”. My “program transform” would describe the neat trick which Breshnam used in his famous algorithm, transforming line drawing into an incremental algorithm.

Web designers do this kind of thing all the time. It’s very uncool these days to write HTML pages which contain layout and formatting. The cool kids all now write their content in XML, and then produce a seperate “presentation transform” which is applied to make the version you actually look at. This allows the content and presentation to be controlled independently, rather than mixing the two together into a tangled mess.

It is not so easy to do this with code!

Do we not already have seperate optimizing transforms? Your compiler certainly knows about many potential speed-improving transformations, and it can apply them appropriatly to your code. Thus, your compiler will do its best to generate fast code, but it doesn’t give any guarantees.

Consequently, you may be tempted to apply these same kind of transformations manually to your source code to ensure that they really are getting done. For example, you may manually inline a function in your source code, because the “inline” keyword is technically only a suggestion to the compiler. After all, if inlining is critical to the performance of your application, you want to be absolutely sure it’s happening. You don’t want someone to upgrade the compiler on the build machine and suddenly find that your application crawls at a snails pace because the new compiler has decided not to inline your key function.

Manually inlining isn’t nice. It’s a bit like writing old-school HTML, resplendent with “font” tags and “color” attributes. The resulting code is a tangled mixture which is harder to understand.

Now, if you *knew* for sure that inline is going to be applied, you can make the source code clearer by splitting big lumps of code out into functions. You’re gaining code clarity by using functions, but you aren’t suffering any speed-loss because you *know* the functions are flattened out before code generation.

I sometimes think about this back-to-front. Imagine that you *know* that the compiler will honour an “inline” annotation before generating code (eg. some variation on __force_inline). In that case, it is safe for you to apply the opposite transformation (eg. splitting out into functions) to your source code. This means you don’t suffer a performance cost, but you gain a code-clarity points. If “inline” is an performance-boosting transformation, then the inverse of “inline” is a clarity-boosting transformation. What’s the inverse of “inline”? It’s the popular “extract method” refactoring, with all the well-known benefits it brings.

(As an aside, this makes you wonder what a “function” really is. Usually, a function is like a subroutine. You expect to see a CALL instruction, and a RET instruction somewhere. When you start applying transformations to your code, it becomes clear that functions are more like named scopes).

So, if we have a cast-iron guarantee from the compiler that it will honour “inline” we can simplify our code by applying the anti-transform to our source. We can utilise the clarity of functions without any of the cost.

Getting the required guarantee is the difficult bit today. Your language would really need annotations something like “compiler_must(inline)”. Actually, I guess that could be “optimizer_must(inline)” to underline the fact that “compiler” is really two black boxes – an program optimizer, and a code-generator. Either way, this is a much stronger assertion than merely having a comment in your source code which says “// I checked, and the compiler is definitely inlining this function today”.

Are compiler-based optimizations the whole story? No, of course not. A compiler has a very limited understanding of what your program does. It does not realise, for example, that you may be using matrices. It does not know that matrices have lots of interesting properties which could be exploited to make the code run faster. If you were just doing integer arithmetic, then the compiler could help. Integer arithmetic is part of the language, and consequently the compiler probably knows facts like “a * b = b * a”. But once you move away from core language features, the compiler is limited by a lack of domain-specific knowledge.

Clearly, we need some means of capturing relevant domain-specific information (such as “M * I = M, where I is the identity matrix) and then describing domain-specific optimizations which use these properties. Our optimizer in the compiler could then work at a much higher level. Furthermore, now that we have been freed from the boring work of producing optimal code, we can program at a very high level, leaving the task of producing optimal code to the optimizer. Hmm, we’ve arrived at the notion of DSL (domain specific languages) through a very performance-driven route. Normally I think about DSLs because they make programs more simple and easy to understand. I guess that’s just a reflection of the “optimizations are anti-clarity tranforms” idea.

I’ve wafted over the difficulties of actually writing one of these transforms. What language are you going to express it in? Returning to the HTML/XML example, there are a myriad of general purpose and domain specific languages for transforming XML. For low-level source code transforms, we would need to operate in a world of abstract syntax trees and dataflow analysis. For higher level transforms, we will be working in terms of domain concepts, like matrices.

Regardless of what language you use, you need to be very sure that the transformation is correct. It can contain bugs, just like any other program. There are probably many lines of code within your compiler dedicated to the “inline” transform, yet more for the “loop unrolling” transform. Imagine the effect which a bug in your compilers optimizer would have on your product. If we are going to write our own transformation too, they need to be totally bullet-proof.

Actually, that’s not a totally fair comparison. Compilers have to decided, based on limited information, whether it is safe to apply a particular transformation. If the preconditions are met, then the implementation of the transformation must work correctly in all possible circumstances. That means it has to assume the worst case scenario. For example, if your C++ code uses pointers at all then lots of optimizations aren’t possible – since writing through the pointer could conceivably change any of local/class/global variables in the program (the pointer aliasing problem). The compiler, by itself, doesn’t have any more information about how the pointer is actually used. Most compilers have an command-line option which is equivalent to “you can assume that pointers don’t alias”.

However, a transformation in your own code doesn’t have to be fully general. You have to take on the responsibility for ensuring that it’s valid for the cases which you need it to be. Going back to the HTML/XML example, you often find that people create stylesheets for fixed-position layouts, and they check that it look okay. Then they go back and add a new paragraph to the content. As a result, the paragraph overlaps one of the picture and the layout looks bad. Was the original stylesheet buggy? Not particularly, since it did the job fine before the new paragraph was added. It just wasn’t general enough to remain correct in the face of content changed.

Finally, we need to think about the effect which optimizers have on the rest of the development environment. Debuggers, coverage tools and profilers would all be effected by such a scheme. Stepping through compiler-optimized code in the debugger is often “exciting” as it is, and so adding a whole new layer of domain-specific optimizations will make things much worse.

As is traditional when I have a new idea, someone has already got there before me! It’s fun thinking through these things for a while before you read the literature. I wrote all of the above stuff, then went to the web and discovered Todd Veldhuizen’s web page. He does research on this very topic. I’ll have to forgive him for being so keen on C++ template metaprogramming (which is the subject of a whole seperate rant) because there’s loads of interesting information on his site. I was happy to see that my ramblings above bear some resemblence to the stuff he talks about.

[ If I offered a free pint of beer to anyone who made it through the whole article, would I get any takers? 😉 ]

C++ Refactoring Tools

Well, I tried to kickstart this year’s attempt to write a C++ refactoring tool by asking EDG if I could somehow get a cheap license for their C++ front end. In the past, I’ve got bogged down in the difficulty of parsing C++ and always gave up in disgust, vowing never to waste my time on such a primitive language. This time, I thought I could avoid doing the work myself and buy in a third party solution. Unfortunately, EDG either want a big up-front sum, or some concrete guarantee that I’ll sell lots of copies of the (expensive) resulting software. That’s understandable, since they need to make money, but it makes this approach unfeasible for me.

A few days later, I notice that the Xrefactory bods have apparently been taking the same route recently. Clever people!

Of course, having a parsed-and-semantically-analysed representation of a C++ program is only half the battle. In an ideal world, you’d throw Bill Opdyke’s thesis at the problem and you’d be done. But in the real world, you have a preprocessor to deal with. Since C and C++ have a preprocessor which works at the lexical level, it can perform horrendously arbitary transformations on the program. The preprocessor munges text, and has no awareness of the syntactic/semantic structures which it is manipulating.

Refactoring in the presence of the preprocessor is hard. Fortunately, some clever people have been worrying about this recently and published a pile of papers on the subject.

I shall read these papers soon, become disgusted by the inelegance of C++ and give up on my “C++ refactoring tool” project for the fifth time.