The cost of generality?

The nice thing about BUGS/JAGS/Stan/etc is that they can operate on arbitrarily complex bayesian networks. You can take my running ‘coin toss’ example and add extra layers. Imagine that we believe that the mint who made the coin produces coins who bias ranges between theta=0.7 and theta=0.9 uniformly. Now we can take data about coin tosses, and use it to infer not only knowledge about the bias of one coin, but also about the coins made by the mint.

But this kind of generality comes at a cost. Let’s look at a simpler model: we have ten datapoints, drawn from a normal distribution with mean mu and standard deviation sigma, and we start with uniform priors over mu and sigma.

For particular values of mu and sigma, the posterior density is proportional to the likelihood, which is a product of gaussians. However, we can avoid doing a naive N exponentials with a bit of algebra, instead doing a single exponential involving a summation. So, as we add more data points, the runtime cost of evaluating the posterior (or at least something proportional to it) will rise, but only at the cost of a few subtractions/squares/divides rather than more exponentials.

In contrast, when I use JAGS to evaluate 20 datapoints, it does twice as many log() calls as it does for 10 datapoints, so seems not to be leveraging any algebraic simplifications.

Next step: write a proof of concept MCMC sampler which runs faster than JAGS for the non-hierarchical cases which are most useful to me.

JAGS: normal and not-normal performance

Previously, we’ve walked through a coin-tossing example in JAGS and looked at the runtime performance. In this episode, we’ll look at the cost of different distributions.

Previously, we’ve used a uniform prior distribution. Let’s baseline on 100,000 step chain, with 100 datapoints. For uniform prior, JAGS takes a mere 0.2secs. But change to a normal prior, such as dnorm(0.5,0.1), and JAGS takes 3.3sec – with __ieee754_log_avx called from DBern::logDensity taking up 80% of the CPU time, according to perf:


  70.90%  jags-terminal  libm-2.19.so         [.] __ieee754_log_avx
   9.27%  jags-terminal  libjags.so.4.0.2     [.] _ZNK4jags20ScalarStochasticNode10logDensityEjNS_7PDFTypeE
   5.09%  jags-terminal  bugs.so              [.] _ZNK4jags4bugs5DBern10logDensityEdNS_7PDFTypeERKSt6vectorIPKdSaIS5_EES5_S5_

If we go from bernoulli data with normal prior, to normal data with normal priors on mean/sd, it gets more expensive again – 4.8 seconds instead of 3.3 – as the conditional posterior gets more complex. But still it’s all about the logarithms.

Logarithms aren’t straightforward to calculate. Computers usually try to do a fast table lookup, falling back to a series-expansion approach if that doesn’t work. On linux, the implementation comes as part of the GNU libc, with the name suggesting that it uses AVX instructions if your CPU is modern enough (there’s not a “logarithm” machine code instruction, but you can use AVX/SSE/etc to help with your logarithm implementation).

Notably, JAGS is only using a single core throughout all of this. If we wanted to compute multiple chains (eg. to check convergence) then the simple approach of running each chain in a separate JAGS process works fine – which is what the jags.parfit R package does. But could you leverage the SIMD nature of SSE/AVX instruction to run several chains in parallel? To be honest, the last time I was at this low level, SSE was just replacing MMX! But since AVX seems to have registers which hold eight 32-bit floats perhaps there’s the potential to do 8 chains in blazing data-parallel fashion?

(Or alternatively, how many people in the world care both about bayesian statistics and assembly-level optimization?!)

Just for reference, on my laptop a simple “double x=0; for (int i=0; i<1e8; i++) x = log(x);” loop takes 6.5 seconds, with 75% being in __ieee754_log_avx – meaning each log() is taking 48ns.

To complete the cycle, let’s go back to JAGS with a simple uniform prior and bernoulli likelihood, only do ten updates, with one datapoints and see how many times ‘log’ is called. For this, we can use ‘ltrace’ to trace calls to shared objects like log():


$ ltrace -xlog   $JAGS/jags/libexec/jags-terminal  example.jags  2>&1 | grep -c log@libm.so

Rather surprisingly, the answer is not stable! I’ve seen anything from 20 to 32 calls to log() even though the model/data isn’t changing (but the random seed presumably is). Does that line up with the 3.4 seconds to do 10 million steps @ 10 data points, if log() takes 48ns? If we assume 2 calls to log(), then 10e6 * 10 * 2 * 48e-9 = 9.6secs. So, about 2.8x out but fairly close.

Next step is to read through the JAGS code to understand the Gibbs sampler in detail. I’ve already read through the two parsers and some of the Graph stuff, but want to complete my understanding of the performance characteristics.

Performance impact of JAGS

In the previous two posts (here and here) I walked through an example of using JAGS directly to analyse a coin-toss experiment.

I’m interested to learn how the runtime of JAGS is affected by model choice and dataset size, and where the time is spent during evaluation. JAGS is open-source, and written in C++, so it’s quite easy to poke around the innards.

First, let’s do some high level black-box tests. We’ll take the coin-flip example from the previous post, and see how the wall-clock time on my rather old laptop increases as a) we increase the chain length, and b) as we increase the dataset size. My expectation is that both will be linear, since JAGS only uses a single core.

For ten coin flips, and 10/20/30 million steps it takes 3.4/6.9/10.4 seconds without monitors, and 4.5/8.8/12.9 seconds with a monitor. Plugging that into R shows a nice linear relationship, and we can get R to build a linear model for us to stop us having to think too hard:


> t < - read.table(stdin(), header=T)
0: steps time
1: 10000000 3.4
2: 20000000 6.9
3: 30000000 10.4

> lm( time ~ steps, t)
Coefficients:
(Intercept)        steps  
   -1.0e-01      3.5e-07  

In other words, each step takes 0.3 microseconds.

Similarly, if we stick to 10 million steps and no monitors, but increase the dataset size across 10/20/50/100, it takes 3.4/4.2/5.8/8.6 seconds which R also shows is linear albeit with a 3 second intercept.


> t < - read.table(stdin(), header=T)
0: datapoints time
1: 10 3.5
2: 20 4.2
3: 50 5.8
4: 100 8.6
5: 
> lm(time ~ datapoints, t)
Coefficients:
(Intercept)   datapoints  
    3.00408      0.05602  

So this mean that it takes 3 seconds to do a 10 million step walk, and although adding more datapoints makes each step more expensive, it’s only a little bit more expensive – 10 datapoints being about 0.5 seconds more than 1 datapoint. However, if we desired to go to “big data” with, say 10 million data points, we’d be talking about half a million seconds – ie. 11 days. So let’s hope we don’t need 10 million steps on a 10 million point dataset!

Next thing on the list is to understand where all that time is going. For this we can use the lovely perf tools which were added to linux in 2.6:


$ perf record jags example.jags 
Welcome to JAGS 4.2.0 on Sun Feb 26 16:00:26 2017
...
Initializing model
Updating 10000000
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.578 MB perf.data (14737 samples) ]

$ perf report
Overhead  Command        Shared Object        Symbol                                                                                                                           
  33.32%  jags-terminal  libm-2.19.so         [.] __ieee754_log_avx
  19.87%  jags-terminal  basemod.so           [.] _ZN4jags4base15WichmannHillRNG7uniformEv
  11.90%  jags-terminal  libjrmath.so.0.0.0   [.] jags_rbeta
  10.08%  jags-terminal  libm-2.19.so         [.] __ieee754_exp_avx

So this shows that the vast majority of time is being spent calculating logs and exponentials. Wichmann Hill is a pseudo-random uniform number generator. But why would you need exp/log to generate bernoulli distribution using a uniform prior?

Let’s use a debugger to see why it’s calling the log function ..


$ jags -d gdb
(gdb) b __ieee754_log_avx
Breakpoint 1 (__ieee754_log_avx) pending.
(gdb) r example.jags
Starting program: /home/adb/tmp/jags/libexec/jags-terminal example.jags
Welcome to JAGS 4.2.0 on Sun Feb 26 16:14:24 2017
...
Initializing model
Updating 10000000

Breakpoint 1, __ieee754_log_avx (x=16.608779218128113) at ../sysdeps/ieee754/dbl-64/e_log.c:57
57	../sysdeps/ieee754/dbl-64/e_log.c: No such file or directory.
(gdb) bt
#0  __ieee754_log_avx (x=16.608779218128113) at ../sysdeps/ieee754/dbl-64/e_log.c:57
#1  0x00007ffff66abce9 in jags_rbeta (aa=1, bb=0.5, rng=0x63b700) at rbeta.c:102
#2  0x00007ffff690e42e in jags::bugs::ConjugateBeta::update (this=0x63c200, chain=0, rng=0x63b700) at ConjugateBeta.cc:157
#3  0x00007ffff7b8b464 in jags::ImmutableSampler::update (this=0x63c170, rngs=std::vector of length 1, capacity 1 = {...}) at ImmutableSampler.cc:28

Our uniform prior is equivalent to a beta(1,1) prior, and since beta and bernoulli distributions are conjugate, our posterior will be a beta distribution. For Gibbs sampling, each “jump” is a draw from a single parameter conditional distribution – and since we only have one parameter theta, each “jump” sees us draw from a beta distribution.

Of course, we could’ve used this fact to calculate the posterior distribution algebraically and avoid all of this monkeying about with MCMC. But the purpose was to explore the performance of the JAGS implementation rather than solve a coin-toss problem per-se.

In the next article, I’ll look at the performance cost of switching to other distributions, such as normal and lognormal.

JAGS, and a bayesian coin toss

In the previous post, I talked about Bayesian stats and MCMC methods in general. In this post, I’ll work through an example where we try to infer how fair a coin-toss is, based on the results of ten coin flips. Most people use JAGS via an R interface, but I’m going to use JAGS directly to avoid obfuscation.

(Note: a coin-toss is a physical event determined by physics, so the “randomness” arises only through uncertainty of how hard it’s tossed, how fast it spins, where it lands etc, and therefore is open to all sorts of evil)

Firstly, we have to tell JAGS about our problem – eg. how many coin tosses we’ll do, and that we believe each coin toss is effectively a draw from a Bernoulli distribution with unknown proportion theta, and what our prior beliefs about theta are.

To do this, we create “example.model” containing:


model {
  for (i in 1:N){
    x[i] ~ dbern(theta)
  }
  theta ~ dunif(0,1)
}

This says that we’ll have N coin-flips, and each coin flip is assumed to be drawn from the same Bernoulli distribution with unknown proportion theta. We also express our prior belief that all values of theta from zero to one are equally likely.

We can now launch “jags” in interactive mode:


$ jags
Welcome to JAGS 4.2.0 on Sun Feb 26 14:31:57 2017
JAGS is free software and comes with ABSOLUTELY NO WARRANTY
Loading module: basemod: ok
Loading module: bugs: ok

.. and tell it to load our example.model file ..


. model in example.model

If the file doesn’t exist, or the model is syntactically invalid you’ll get an error – silence means everything has gone fine.

Next, we need the data about the coin flip, which corresponds to the x[1] .. x[N] in our model. We create a file called “example.data” containing:


N < - 10
x <- c(0,1,0,1,1,1,0,1,0,0)

The format for this file matches what R’s dump() function spits out. Here we’re saying that we have flipped ten coins (N is 10) and the results were tails/heads/tails/heads/heads etc. I’ve chosen the data so we have the same number of heads and tail, suggesting a fair coin.

We tell JAGS to load this file as data:


. data in example.data
Reading data file example.data

Again, it’ll complain about syntax errors (in an old-school bison parser kinda way) or if you have duplicate bindings. But it won’t complain yet if you set N to 11 but only provided 10 data points.

Next, we tell JAGS to compile everything. This combines your model and your data into an internal graph structure, ready for evaluating. It’s also where JAGS will notice if you’ve got too few data points or any unbound names in your model.


. compile
Reading data file example.data
. compile
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 10
   Unobserved stochastic nodes: 1
   Total graph size: 14

The graph consists of ten “observed” nodes (one per coin flip) and one unobserved stochastic node (the unknown value of theta). The other nodes presumably include the bernoulli distribution and the uniform prior distribution.

At this stage, we can tell JAGS where it should start its random walk by providing an initial value for theta. To do this, we create a file “example.inits” containing:


theta < - 0.5

.. and tell JAGS about it ..


. parameters in example.inits
Reading parameter file example.inits

Finally, we tell JAGS to initialize everything so we’re ready for our MCMC walk:


. initialize
Initializing model

Now we’re ready to start walking. We need to be a bit careful at first, because we have to choose a starting point for our random walk (we chose theta=0.5) and if that’s not a good choice (ie. it corresponds to a low posterior probability) then it will take a while for the random walk to dig itself out of the metaphorical hole we dropped it in. So, we do a few thousand steps of our random walk, give it a fancy name like “burn-in period” and cross our fingers that our burn-in period was long enough:


. update 4000
Updating 4000
-------------------------------------------------| 4000
************************************************** 100%

(JAGS give some enterprise-level progress bars when in interactive mode, but not in batch mode).

JAGS has happily done 4000 steps in our random walk, but it hasn’t been keeping track of anything. We want to know what values of theta is jumping between, since that sequence (aka “chain”) of values is what we want as output.

To tell JAGS to start tracking where it’s been, we create a sampler for our ‘theta’ variable, before proceeding for another 4000 steps, and then writing the results out to a file:


. monitor theta
. update 4000
-------------------------------------------------| 4000
************************************************** 100%
. coda *

The last command causes two files to be written out – CODAindex.txt and CODAchain1.txt. CODA is a hilariously simple file format, coming originally from the “Convergence Diagnostic and Output Analysis” package in R/S-plus. Each line contains a step number (eg. 4000) and the value of theta at that step (eg. 0.65).

Here’s an interesting thing – why would we need a “Convergence Diagnostic” tool? When we did our “burn-in” phase we crossed our fingers and hoped we’d ran it for long enough. Similarly, when we did the random walk we also used 4000 steps. Is 4000 enough? Too many? We can answer these questions by looking at the results of the random walk – both to get the answer to our original question, but also to gain confidence that our monte-carlo approximation has thrown enough darts to be accurate.

At this point, we’ll take our coda files and load them into R to visualize the results.

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> require(coda)
Loading required package: coda

> c < - read.coda(index.file="CODAindex.txt",output.file="CODAchain1.txt")
Abstracting theta ... 5000 valid values

> summary(c)
Iterations = 4001:9000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean             SD       Naive SE Time-series SE 
      0.501658       0.139819       0.001977       0.001977 

2. Quantiles for each variable:

  2.5%    25%    50%    75%  97.5% 
0.2436 0.4000 0.5022 0.6017 0.7675 

This is telling us that, given ten coin flips and our prior uniform belief and our bernoulli assumption, the most probably value for theta (the proportion of coin-flip yielding heads) is close to 0.5. Half of the probability mass lies between theta=0.4 and theta=0.6, and 95% of the probability mass lies between theta=0.25 and theta=0.75.

So it’s highly unlikely that the coin flip is extremely biased – ie. theta<0.25 or theta>0.75. Pleasantly, “highly unlikely” means “probability is less than 5%”. That’s a real common-or-garden probability. Not any kind of frequencist null-hypothesis p-value. We can make lots of other statements too – the probability that the bias is greater than 0.75 is about 40%. If we had a second coin (or coin flipper) we could make statement like “the probability that coin2 has a higher bias than coin1 is xx%”.

Let’s briefly revisit the question of convergence. There’s a few ways to determine how well your random walk represents (or “has converged to”) the true posterior distribution. One way, by Rubin and Gelman, is to run several random walks and look at the variance between them. The coda package in R comes with a function gelman.diag() for this purpose. However, in our simple example we only did one chain so we can’t run it on our coda files. (Incidentally, Gelman writes a great blog about stats).

In the next post, I’m will look at the performance characteristics of JAGS – how it scales with the number of data points, and what tools you can use to track this.

No BUGS, instead JAGS

JAGS is a useful statistics tool, helping you decide how to generalise experimental results. For example, if you roll a die ten times and it comes up “six” half of the time, is that strong evidence that the die is loaded? Or if you toss a coin ten times and it comes up heads nine times, what the probability that the coin is a “normal” coin?

JAGS is based on the bayesian approach to statistics, which uses Bayes rule to go from your experimental results to a (probabilistic) statement of how loaded the dice is. This is different approach from the frequencist approach to statistics which most textbooks cover – with p values and null hypothesis test. The upside of the bayesian approach is that it answers the kind of questions you want to ask (like, “what is the probability that drug A is better than drug B at treating asthma”) as opposed to the convoluted questions which frequencist statistics answer (“assuming that there’s no difference between drug A and drug B, what’s the probability that you’d get a measured difference at least as large as the one you saw?”). The downside of Bayesian stats is that you have to provide a “prior” probability distribution, which expresses your beliefs of how likely each outcome is prior to seeing any experiment results. That can seem a bit ugly, since it introduces a subjective element to the calculation. Some people find that unacceptable, and indeed that’s why statistics forked in the early 1900s from its bayesian origins to spawn the frequencist school of probability with it’s p-values driven by popular works by Ronald Fisher. But on the other hand, no experiment is run in a vacuum. We do not start each experiment in complete ignorance, nor is our selection of which experiments to run, or which hypotheses to check, determined objectively. The prior allows us to express information from previous knowledge, and we can rerun our analysis with a range of priors to see how sensitive our results are to the choice of prior.

Although Bayes rule is quite simple, only simpler textbook examples can be calculated exactly using algebra. This does include a few useful cases, like the coin-flipping example used earlier (so long as your prior comes from a particular family of probability distribution). But for more real-world examples, we end up using numerical techniques – in particular, “Markov Chain Monte Carlo” methods. “Monte Carlo” methods are anything where you do simple random calculations which, when repeated enough times, converge to the right answer. A nice example is throwing darts towards a circular darts board mounted on a square piece of wood – if the darts land with uniform probability across the square, you can count what fraction land inside the circle and from that get an approximation of Pi. As you throw more and more darts, the approximation gets closer and closer to the right answer. “Markov Chain” is the name given to any approach where the next step in your calculation only depends on the previous step, but not any further back in history. In Snakes and Ladders, your next position depends only on your current position and the roll of the die – it’s irrelevant where you’ve been before that.

When using MCMC methods for Bayesian statistics, we provide our prior (a probability distribution) and some data and a choice of model with some unknown parameters, and the task is to produce probability distributions for these unknown parameters. So our model for a coin toss might be a Bernoilli distribution with unknown proportion theta, our prior might be a uniform distribution from 0 to 1 (saying that we think all values of theta are equally likely) and the data would be a series of 0’s or 1’s (corresponding to heads and tails). We run our MCMC algorithm of choice, and out will pop a probability distribution over possible values of theta (which we call the ‘posterior’ distribution). If our data was equally split between 0’s and 1’s, then the posterior distribution would say that theta=0.5 was pretty likely, theta=0.4 or theta=0.6 fairly likely and theta=0.1 or theta=0.9 much less likely.

There’s several MCMC methods which can be used here. Metropolis-Hasting, created in 1953 during the Teller’s hydrogen bomb project, works by jumping randomly around the parameter space – always happy to jump towards higher probability regions, but will only lump to lower probability regions some of the time. This “skipping around” yields a sequence (or “chain”) of values for theta drawn from the posterior probability distribution. So we don’t ever directly get told what the posterior distribution is, exactly, but we can draw arbitrarily many values from it in order to answer our real-world question to sufficient degree of accuracy.

JAGS uses a slightly smarter technique called Gibbs Sampling which can be faster because, unlike Metropolis-Hasting, it never skips/rejects any of the jumps. Hence the name JAGS – Just Another Gibbs Sampler. You can only use this if it’s easy to calculate the conditional posterior distribution, which is often the case. But it also frees you from the Metropolis-Hasting need to have (and tune) a “proposal” distribution to choose potential jumps.

In the next post, I’ll cover pragmatics of running JAGS on a simple example, then look at the performance characteristics.