Performance impact of JAGS

In the previous two posts (here and here) I walked through an example of using JAGS directly to analyse a coin-toss experiment.

I’m interested to learn how the runtime of JAGS is affected by model choice and dataset size, and where the time is spent during evaluation. JAGS is open-source, and written in C++, so it’s quite easy to poke around the innards.

First, let’s do some high level black-box tests. We’ll take the coin-flip example from the previous post, and see how the wall-clock time on my rather old laptop increases as a) we increase the chain length, and b) as we increase the dataset size. My expectation is that both will be linear, since JAGS only uses a single core.

For ten coin flips, and 10/20/30 million steps it takes 3.4/6.9/10.4 seconds without monitors, and 4.5/8.8/12.9 seconds with a monitor. Plugging that into R shows a nice linear relationship, and we can get R to build a linear model for us to stop us having to think too hard:


> t < - read.table(stdin(), header=T)
0: steps time
1: 10000000 3.4
2: 20000000 6.9
3: 30000000 10.4

> lm( time ~ steps, t)
Coefficients:
(Intercept)        steps  
   -1.0e-01      3.5e-07  

In other words, each step takes 0.3 microseconds.

Similarly, if we stick to 10 million steps and no monitors, but increase the dataset size across 10/20/50/100, it takes 3.4/4.2/5.8/8.6 seconds which R also shows is linear albeit with a 3 second intercept.


> t < - read.table(stdin(), header=T)
0: datapoints time
1: 10 3.5
2: 20 4.2
3: 50 5.8
4: 100 8.6
5: 
> lm(time ~ datapoints, t)
Coefficients:
(Intercept)   datapoints  
    3.00408      0.05602  

So this mean that it takes 3 seconds to do a 10 million step walk, and although adding more datapoints makes each step more expensive, it’s only a little bit more expensive – 10 datapoints being about 0.5 seconds more than 1 datapoint. However, if we desired to go to “big data” with, say 10 million data points, we’d be talking about half a million seconds – ie. 11 days. So let’s hope we don’t need 10 million steps on a 10 million point dataset!

Next thing on the list is to understand where all that time is going. For this we can use the lovely perf tools which were added to linux in 2.6:


$ perf record jags example.jags 
Welcome to JAGS 4.2.0 on Sun Feb 26 16:00:26 2017
...
Initializing model
Updating 10000000
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.578 MB perf.data (14737 samples) ]

$ perf report
Overhead  Command        Shared Object        Symbol                                                                                                                           
  33.32%  jags-terminal  libm-2.19.so         [.] __ieee754_log_avx
  19.87%  jags-terminal  basemod.so           [.] _ZN4jags4base15WichmannHillRNG7uniformEv
  11.90%  jags-terminal  libjrmath.so.0.0.0   [.] jags_rbeta
  10.08%  jags-terminal  libm-2.19.so         [.] __ieee754_exp_avx

So this shows that the vast majority of time is being spent calculating logs and exponentials. Wichmann Hill is a pseudo-random uniform number generator. But why would you need exp/log to generate bernoulli distribution using a uniform prior?

Let’s use a debugger to see why it’s calling the log function ..


$ jags -d gdb
(gdb) b __ieee754_log_avx
Breakpoint 1 (__ieee754_log_avx) pending.
(gdb) r example.jags
Starting program: /home/adb/tmp/jags/libexec/jags-terminal example.jags
Welcome to JAGS 4.2.0 on Sun Feb 26 16:14:24 2017
...
Initializing model
Updating 10000000

Breakpoint 1, __ieee754_log_avx (x=16.608779218128113) at ../sysdeps/ieee754/dbl-64/e_log.c:57
57	../sysdeps/ieee754/dbl-64/e_log.c: No such file or directory.
(gdb) bt
#0  __ieee754_log_avx (x=16.608779218128113) at ../sysdeps/ieee754/dbl-64/e_log.c:57
#1  0x00007ffff66abce9 in jags_rbeta (aa=1, bb=0.5, rng=0x63b700) at rbeta.c:102
#2  0x00007ffff690e42e in jags::bugs::ConjugateBeta::update (this=0x63c200, chain=0, rng=0x63b700) at ConjugateBeta.cc:157
#3  0x00007ffff7b8b464 in jags::ImmutableSampler::update (this=0x63c170, rngs=std::vector of length 1, capacity 1 = {...}) at ImmutableSampler.cc:28

Our uniform prior is equivalent to a beta(1,1) prior, and since beta and bernoulli distributions are conjugate, our posterior will be a beta distribution. For Gibbs sampling, each “jump” is a draw from a single parameter conditional distribution – and since we only have one parameter theta, each “jump” sees us draw from a beta distribution.

Of course, we could’ve used this fact to calculate the posterior distribution algebraically and avoid all of this monkeying about with MCMC. But the purpose was to explore the performance of the JAGS implementation rather than solve a coin-toss problem per-se.

In the next article, I’ll look at the performance cost of switching to other distributions, such as normal and lognormal.