No BUGS, instead JAGS – Andrew Birkett's blog

JAGS is a useful statistics tool, helping you decide how to generalise experimental results. For example, if you roll a die ten times and it comes up “six” half of the time, is that strong evidence that the die is loaded? Or if you toss a coin ten times and it comes up heads nine times, what the probability that the coin is a “normal” coin?

JAGS is based on the bayesian approach to statistics, which uses Bayes rule to go from your experimental results to a (probabilistic) statement of how loaded the dice is. This is different approach from the frequencist approach to statistics which most textbooks cover – with p values and null hypothesis test. The upside of the bayesian approach is that it answers the kind of questions you want to ask (like, “what is the probability that drug A is better than drug B at treating asthma”) as opposed to the convoluted questions which frequencist statistics answer (“assuming that there’s no difference between drug A and drug B, what’s the probability that you’d get a measured difference at least as large as the one you saw?”). The downside of Bayesian stats is that you have to provide a “prior” probability distribution, which expresses your beliefs of how likely each outcome is prior to seeing any experiment results. That can seem a bit ugly, since it introduces a subjective element to the calculation. Some people find that unacceptable, and indeed that’s why statistics forked in the early 1900s from its bayesian origins to spawn the frequencist school of probability with it’s p-values driven by popular works by Ronald Fisher. But on the other hand, no experiment is run in a vacuum. We do not start each experiment in complete ignorance, nor is our selection of which experiments to run, or which hypotheses to check, determined objectively. The prior allows us to express information from previous knowledge, and we can rerun our analysis with a range of priors to see how sensitive our results are to the choice of prior.

Although Bayes rule is quite simple, only simpler textbook examples can be calculated exactly using algebra. This does include a few useful cases, like the coin-flipping example used earlier (so long as your prior comes from a particular family of probability distribution). But for more real-world examples, we end up using numerical techniques – in particular, “Markov Chain Monte Carlo” methods. “Monte Carlo” methods are anything where you do simple random calculations which, when repeated enough times, converge to the right answer. A nice example is throwing darts towards a circular darts board mounted on a square piece of wood – if the darts land with uniform probability across the square, you can count what fraction land inside the circle and from that get an approximation of Pi. As you throw more and more darts, the approximation gets closer and closer to the right answer. “Markov Chain” is the name given to any approach where the next step in your calculation only depends on the previous step, but not any further back in history. In Snakes and Ladders, your next position depends only on your current position and the roll of the die – it’s irrelevant where you’ve been before that.

When using MCMC methods for Bayesian statistics, we provide our prior (a probability distribution) and some data and a choice of model with some unknown parameters, and the task is to produce probability distributions for these unknown parameters. So our model for a coin toss might be a Bernoilli distribution with unknown proportion theta, our prior might be a uniform distribution from 0 to 1 (saying that we think all values of theta are equally likely) and the data would be a series of 0’s or 1’s (corresponding to heads and tails). We run our MCMC algorithm of choice, and out will pop a probability distribution over possible values of theta (which we call the ‘posterior’ distribution). If our data was equally split between 0’s and 1’s, then the posterior distribution would say that theta=0.5 was pretty likely, theta=0.4 or theta=0.6 fairly likely and theta=0.1 or theta=0.9 much less likely.

There’s several MCMC methods which can be used here. Metropolis-Hasting, created in 1953 during the Teller’s hydrogen bomb project, works by jumping randomly around the parameter space – always happy to jump towards higher probability regions, but will only lump to lower probability regions some of the time. This “skipping around” yields a sequence (or “chain”) of values for theta drawn from the posterior probability distribution. So we don’t ever directly get told what the posterior distribution is, exactly, but we can draw arbitrarily many values from it in order to answer our real-world question to sufficient degree of accuracy.

JAGS uses a slightly smarter technique called Gibbs Sampling which can be faster because, unlike Metropolis-Hasting, it never skips/rejects any of the jumps. Hence the name JAGS – Just Another Gibbs Sampler. You can only use this if it’s easy to calculate the conditional posterior distribution, which is often the case. But it also frees you from the Metropolis-Hasting need to have (and tune) a “proposal” distribution to choose potential jumps.

In the next post, I’ll cover pragmatics of running JAGS on a simple example, then look at the performance characteristics.