June 2020 – Andrew Birkett's blog

I’m not a medic, but I’ve worked with data, analysis and experiments for a while. This blog post is a “what’s going on” summary of hydroxychloroquine use in treating COVID-19.

Hydroxychloroquine (HCQ for short) is a medicine that helps treat a few different conditions, like malaria and arthritis. Maybe it’ll help COVID too? Noone was sure, so many hospitals around the world started trying it and recording the outcome.

The question we want to answer is: does giving this medicine 1) help people, 2) harm people, or 3) does nothing at all when given to COVID patients. If it does have an effect, it might have a big effect or a small effect. The effect might be different in different patients, possibly due to age, genetics (including gender, race/ethnicity) and existing health conditions. We always start in a “we don’t know” state, and use data to end up in one of the three answers.

What’s happened recently is that we went from “don’t know” to “it harms people”, based on one research group’s analysis of some hospital data. But, it looks like their analysis might not be done to a high enough standard. So we’re shifting back towards “don’t know”. The fact that the evidence that “it harms” has gone away does NOT mean that “it helps”. Lack of evidence is not evidence of lack. It just takes us back to “we don’t know”.

So how do try to answer the help/harms question? The ideal thing to do would be a “randomized control trial”, where we find a group of patients suffering from COVID and then randomly select half them to receive HCQ. This approach gives you the best ability to measure the true effect that HCQ has. However, this is not what’s happened in this story. Randomised controlled trials are slow to set up – you usually need to find consenting volunteers. COVID is more like war-time – doctors are trying lots of things quickly. They are tracking the outcomes, but they’re basing their decision on whether to give HCQ on their experience and belief that it’s the best course of action, rather than the coin-flip of a randomized control trial. Everyone wants the certainty of a randomized controlled trials (and the authors of the controversial paper explicitly call for one). But all we have just now is “observational data” – a record of who got what and when, and what the outcome was.

So can we use the outcome data to answer our question? To get enough data to answer the question, we need access to data from more than one hospital. Hospitals are rightly careful about sharing patient data so this isn’t an easy task. Fortunately, some companies have put in the effort to get contracts signed with several hospitals around the world and so the human race can potentially benefit from insights that are made possible by having this data in one place. One such company is Surgisphere. Surgisphere (and their legal team) have got agreements with 671 hospitals around the world. This gives them access to data about individual patients – their age/gender/etc as well as medical conditions, treatments they’ve received and outcomes.

Surgisphere therefore have a very useful dataset. For now, let’s assume that they’ve managed to pull all this data together without making any systematic mistakes (for example, some countries measure a patients height in centimetres whereas other might use inches – would Surgisphere have noticed this?).

Within Surgisphere’s dataset, they had information about 96032 patients who tested positive for covid. Of those patients, it so happens that the various hospitals had chosen to give HCQ (or chloroquine) to 14,888 patients. The dataset doesn’t tell us specifically why those 14888 got given HCQ – presumably the doctors thought it was their best option at the time based on the patient’s condition, age, weight etc.

Naively, you might expect that we could just compare the death rate in patients who got HCQ (those who we given the drug) with the death rate in patient who didn’t receive HCQ and see if it’s different.

Unfortunately, it’s not that simple. I’ll explain why shortly, but one key message here is “statistical data analysis isn’t simple, there’s a bunch of mistakes that are easy to make, even if you do this a lot”. Consequently, it’s important that people “show their working” by sharing their dataset and analysis so that others can check whether they’ve made any mistakes. If other people don’t have access to the same raw data, they can’t check for these easy-to-make mistakes – and lots of papers get published every year which end up being retracted because they made a data analysis mistakes. Sharing raw data is hard in a medical setting – Surgisphere’s contracts with hospitals probably don’t allow them to share it. But without the raw data being shared and cross-checked by others, it’s reasonable to expect that any analysis has a good chance of having some flaws.

Why can’t we simply compare death rates? It’s because something like your age is a factor in both your chance of dying and whether you end up receiving HCQ from a doctor. Let’s assume for a moment that COVID is more deadly in elderly people (it is). Let’s also assume that doctors might decide the HCQ was the best treatment option for older people, but that younger people had some other better treatment option. In this scenario, even if HCQ has no effect, you’d expect the HCQ-treated patients to have a higher death rate than non-HCQ patients, simply due to their greater age. This kind of mixup is possible to try and fix though – if we know patient ages, we can make sure we’re comparing (say) the group of 80 year olds who got HCQ against the group of 80 year olds who didn’t get HCQ. We’ll look at some of the difficulties in this approach shortly.

The same reasoning applies for other patient factors like gender/race/ethnicity, existing health conditions etc. It also applies to other things that might influence patient outcome, such as what dose of HCQ was given, or how badly ill a patient was when they received HCQ. In an ideal world, we’d have data on all of these factors and we’d be able to adjust our analysis to take it all into account. But the more factors we try to take into account, the larger the dataset we need to do our analysis – otherwise we end up with just 1 or 2 patients in each ‘group’.

The whole dataset itself can easily be skewed. The hospitals which gave Surgisphere their data might all be expensive private hospitals with fancy equipment and good connections to whizzy American medical corporations, whereas hospitals in poorer areas might be too busy treating basic needs to worry about signing data sharing contracts. Private hospitals are more likely to be treating affluent people who suffer less from poverty-related illness. We can try to correct for known factors (like existing medical conditions) in our data analysis, but if the selection of hospitals itself was skewed then we’re starting the game with a deck stacked against us.

One worry is always that you can only adjust for factors that are mentioned in your dataset. For example, let’s suppose asthma makes COVID more deadly (I’m making this up as an example) but that our dataset did not provide details of patient asthma. It might the case that all patients with asthma all ended up in the HCQ group (could happen if some alternative treatment was available but known to be not-safe if you have asthma). But if our dataset doesn’t tell us about asthma, we just see that, overall, more HCQ patients died. We wouldn’t be able to see that this difference in death was actually due to a common underlying factor. We might WRONGLY go on to believe that the increased death rate was CAUSED by HCQ, when actually all that happened was higher-risk patients had systematically ended up in the HCQ group.

Back to the story: our plan is to try to pair up each patient in the HCQ group with a “twin” in the non-HCQ group who has exactly the same age, weight, health conditions etc. Doing so allows us tease apart the effect of age/weight/etc from the effect of getting given HCQ. But we almost certainly won’t find an “exact twin” for each HCQ patient – ie. someone who matches on ALL characteristics. Instead, we typically try to identify a subset of non-HCQ patients who are similar in age/weight/etc to the group of patients who were give HCQ. (This is called “propensity score matching analysis”).

The important work here is “try”. There’s usually not a good way to check whether you’ve done a good job here. I might do a rubbish job – perhaps the subset of non-HCQ patients I pick contains way more smokers than are in the HCQ group. We hope that our dataset contains all the important characteristics that allow us to make a genuinely representative set, but if it doesn’t then any comparisons we make between the HCQ group and our non-HCQ “twins” will not be telling us solely about the effect HCQ has. This is the fundamental problem with observational studies, and the only real solution is to do a randomised trial. (BTW, all of economics is based on observational data and suffers this problem throughout).

That’s enough stats details. The main point is that this kind of analysis is hard, and there’s a number of choices that the researcher has to make along the way which might be good or bad choices. The only way to check those choices is to have other experts look at the data.

This brings us to the objections that were raised against this initial publication. There are three kinds of objections raised:

1. The “we know it’s easy to make mistakes, and sharing data is the best way to catch mistakes” stuff. (objection 2). There’s no implication of malicious intent here; Surgisphere need to honour their contracts. But the societal important of understanding COVID is so high that we need to find ways to meet in the middle.
2. The “despite not releasing your raw data, there’s enough data in your paper that we can already spot data mistakes” (objection 5,6,7,8,9). Things like “the reported average HCQ dose is higher than the US max dose, and 66% of the data came from the US”. Or “your dataset says more people died in australia from covid than actually died”. It just doesn’t smell right. If you can spot two mistakes that easily, how many more are lingering in the data.
3. The “you skipped some of the basics” objections – no ethics review, no crediting of hospitals that gave data (objection 3+4)
4. The “you’ve not done the stats right” stuff – (objections 1 and 10)

None of this means that the researchers were definitely wrong; it just means they might be wrong. It also doesn’t mean the researchers were malicious; countless papers are published every year which contain errors that are then picked up by peers. To me that’s a science success – it help us learn what is true and false in the world. But it does mean that a single scientific paper that hasn’t been reproduced by other groups is “early stages” as far as gaining certainty goes.

The best way to know for sure what HCQ does to COVID patients is to run a controlled trial, and this had already started. But if you believe there’s evidence that HCQ causes harm, then ethically you would stop any trial immediately – and this is what happened (WHO trial and UK trial were both paused). But now the “evidence” of harm is perhaps not so strong, and so perhaps it makes sense to restart the controlled trials and learn more directly what the effect of HCQ on COVID patients actually is.