An Exploration of Bayesian A/B Testing
So I’ve recently been looking into using a Bayesian interpretation of A/B test results for work, and have found the discussion so interesting that I’ve decided to write about it! In particular, I’m fascinated at how much disagreement there seems to be among data scientists and statisticians about the validity of early stopping in Bayesian testing – this made me want to do a little testing of my own. If you also have feelings/knowledge about Bayesian statistics, I’d love to discuss this topic with you, but did not feel like figuring out how to enable comments on this blog – please send me an email/FB message/LinkedIn message. You can find the code I used for this post here.
I wasn’t able to do everything I wanted to in this one post (part 2 coming soon), but my takeaway so far is that Bayesian testing is a good alternative to traditional hypothesis tests for AB testing, when you have some prior knowledge of the test metric, and especially when you have unevenly split groups (i.e. 90/10 rather than 50/50). However, the assertion that “early peeking is fine” is more of a philosophical difference than an actual increase in statistical robustness, and should be taken with a grain of salt. In some ways, using Bayes testing (especially if you’re going to early-stop) trades away rigor for impact. However, I still think it delivers trustworthy and useful results, and am planning to use it as my preferred AB testing framework.
First, some background
A quick refresher on A/B testing
Probably the most common experiment most companies perform, A/B testing is an experimental setup that weighs two options against each other. In the most basic version, a sample of users is split in half, with one half getting version A and the other getting version B – for example, a website could test the performance of blue button vs. an orange one. It would route a random half of its visitors to a page with a blue button, and the other half to a page with an orange button. After a while, it would do some analysis to see if there’s a statistically significant difference in how often each button was clicked.
What does “Bayesian” mean in this context anyway?
Bayesian inference is a type of statistical inference we can use to draw conclusions from data. The hallmark of Bayesian inference is that it uses a subjective perspective on the problem as a starting point – that is,the Bayesian approach uses a “prior” distribution that represents your existing belief about the distribution of a random variable, and then uses observed data to update that knowledge and build a “posterior” distribution.
Bayesian A/B testing refers to using a Bayesian approach to interpret the result of an A/B test. I think the best way to explain this is to reuse the simple orange vs. blue button example from the previous section. Before the A/B test starts, you identify your prior beliefs: maybe you had a black button on your site already, and each day between 25% and 50% of visitors clicked the button, so you say that the probability of a button being clicked on your site is between 25% and 50%. You can use that to build a “prior” distribution for the probability of any button being clicked. Then, as the A/B test starts, you have two groups: orange and blue. The observations of how many people click each color button get appended to your prior distribution, and you end up with two “posterior distributions” that represent the probability of each button being clicked. You can then compare your orange distribution to your blue distribution and say things like “there’s a 95% chance that the blue button is more likely to be clicked than the orange button”.
How does this differ from hypothesis testing?
Frequentist hypothesis testing is what many of us learned in high school: dig real deep and try to remember writing things like “p = 0.05 so we reject the null hypothesis with 95% confidence”, or something like that. Frequentist inference operates under a totally different philosophy from Bayesian inference; where Bayesian inference relies on a subjective set of beliefs (i.e., different people can come up with different priors and end up with different results), frequentist inference is objective (the same observations will always give you the same answers). Frequentist inference says “our observations are all we have to rely on, and they represent only a sample of the infinite possible draws of the data (e.g. flips of a coin)”. It doesn’t use or give any probability distributions. Instead, it asks something more like “if we repeated this experiment infinity times, what are the chances that we would observe this result if the hypothesis we set were true?”.
Applying that to the A/B test example we’ve been using, a frequentist analysis would look something like this: we observe that the blue button was clicked by a certain proportion of visitors, and the orange button was clicked by a certain other proportion of visitors. Our analysis outputs a p-value based on how big the sample size is and how different the observations in each group are from each other. The p-value tells us the probability that we’d see these observations or more extreme if there were actually no difference between the two groups. We decide on an acceptable level of error (usually 5%), and say for example “the blue button is more likely to be clicked with 95% confidence”, which means “if we had an infinite number of parallel universes to run this experiment in, and we concluded the blue button is better in each universe, we’d be wrong in 5% of universes”. Note that this is a different statement from the Bayesian one above: “there’s a 95% chance that the blue button is more likely to be clicked”.
My motivations for investigating Bayes tests
I was initially drawn to Bayes testing because I find it to be more intuitive to interpret than hypothesis testing (especially for those without a background in statistics). Additionally, like most companies that are not Google or Facebook, we often struggle with getting a large enough sample size for significant results; it seemed to me that using a Bayes test would make it easier to reach useful outcomes by a) being able to incorporate prior knowledge and b) outputting distributions rather than a p-value. Some people also claim that Bayesian testing is immune to peeking, unlike hypothesis testing, which allows you to save time and stop tests early.
“More interpretable results” is a benefit that doesn’t require any investigation, but I’m not so sure about “faster to reach useful outcomes” and “ability to early-stop”. Besides that, there are a couple other things that are unclear to me about Bayes tests, even after some internet-scouring. For one, hypothesis testing gives an explicit promise regarding Type I error (p-value, a.k.a. the probability of seeing a difference when there is none) and Type II error (power, a.k.a. the probability of not seeing a difference when there is one), but the Bayesian interpretation doesn’t – what kind of Type I/II error should I actually expect with Bayes testing? Another complication with Bayes test is choice of prior – what happens if I choose a prior that’s very strict or very diffuse? What if I choose one that’s just wrong? Finally, hypothesis testing is an extremely widely-accepted way of analyzing A/B test results, and my company didn’t previously use Bayes tests for this purpose. So, there’s some burden of proof in showing that a Bayesian approach won’t cause us to draw wildly inaccurate conclusions.
Objectives, and how we’ll get there
This post will attempt to answer the following questions:
- What kind of Type I and Type II error rate are we dealing with if we use a Bayesian approach to A/B testing?
- How does early stopping (aka “peeking”) affect the Type I error rate in Bayes testing?
I’ll be tackling these additional two in a future post:
- How large of a sample does it take to see a significant result in Bayes vs hypothesis testing?
- How does choice of prior affect the Type I/II error rate of Bayes tests?
To answer these questions, I’ll be simulating a couple situations to see how the test behaves in each one:
- Bernoulli data with no difference, a small difference, a large difference
- Bernoulli data with a 90/10 split
- Bernoulli data with a small difference and low baseline
- Poisson data with no difference, some difference
I’ll represent these situations as comparisons of simulated “populations” of 10,000 random trials. For the purposes of Bayesian analysis, I’ll make up a prior for each population that will represent my beliefs about the population proportions (in the case of bernoulli variables) or means (in the poisson case). I’ve plotted the priors of the 6 simulated A/B test comparisons below so that we can visualize them. When I actually run the Bayes test, I’ll use the prior for group A as the prior for the test (analogous to A being a control group and B being the treatment group).
How much type I/II error do we incur using Bayesian stopping rules?
Even more background: what are type I/II errors in A/B tests?
In A/B testing, Type I/II error rate refers to the probability of thinking that A and B are different when they’re really the same (type I) or thinking that A and B are the same when they’re really different (type 2).
In frequentist hypothesis testing, we set a threshold for significance based on the p-value, which directly sets a maximum type I error rate – that is, if we call a difference “significant” in an hypothesis test, we’re saying that the type I error rate (probability of thinking A and B are different when they’re really the same) is below our threshold p-value. Additionally, it’s standard practice to choose a sample size before the test begins by estimating some parameters and setting a maximum type II error rate, called the “power”. To give a concrete example, a hypothesis test with p = 0.05 and power = 80% will have a 5% probability of type I error and a 20% probability of type II error.
On the other hand, Bayes testing (Bayesian inference in general, really) has the goal of making the best decision from your perspective using the data and beliefs that you have, so it doesn’t explicitly concern itself with type I and II error. In fact, your error rates will vary depending on the prior you choose, even if the data is the exact same. Instead of using p-value (aka type I error rate) as a stopping rule, we have a couple options with Bayes tests: “Expected loss” is the most commonly used stopping rule, and represents the loss we’d expect to incur if we chose the wrong option; another one that interests me is “probability that A > B”, or P(A>B). I’m not usually one to buck the accepted standard, but I have a hard time wrapping my mind around how “expected loss” is mathematically derived, and “there’s an X% chance that treatment is better than control” is the most intuitive possible explanation of an A/B test result anyway, in my opinion – so, I think this (seemingly less commonly-used?) P(A>B) stopping rule is what I’ll be using throughout this blog post.
Result 1: for tests of the same sample size, Bayes testing generally has a lower type I error rate than hypothesis testing, especially for data with a 90/10 split
To calculate type I error rate, we’ll run some Bayes and hypothesis tests where group A and group B are in fact identical, and see how many of the tests falsely detect a significant difference. Our decision rules for each test are: two-tailed p-value <= 0.05 (hypothesis); probability that A and B are different >= 95% (i.e., P(A>B) >= 97.5% or <= 2.5%) (Bayes). We’ll also simulate 4 different types of data: Bernoulli data split 50/50, Poisson data split 50/50, Bernoulli data split 90/10, and Bernoulli data with a low baseline rate. I’ll use the arbitrarily-chosen-but-reasonable-looking priors from the section above for my Bayes tests.
I used 500 tests of 1000 total samples in each estimation of type I error rate, estimated the rate 3 times, and plotted the median with min and max as error bars. Not the most rigorous of error bars, but just enough to get a sense of how much this metric changes from run to run (without waiting 5 million years for this code to run!).
The resulting bar chart (below) shows that Bayes testing results in a slightly lower type I error in most cases. With a 90/10 split, there’s an especially noticeable difference, and with Poisson data, the two tests are comparable. My guess is that the big difference in the 90/10 split case is because, as a result of the use of a prior distribution, Bayes tests are more robust to small sample sizes (as in a data with a 90/10 split).
Some caveats here: 1) because I knew what the actual population proportion/mean was here, I was able to choose an accurate prior for the Bayes test (we’ll investigate how choice of prior impacts error rate a little later on in this post), 2) “p <= 0.05” and “95% probability of difference” aren’t exactly the same thing, so doing this comparison may be a little misleading – however, I chose these decision rules for the comparison because they’re probably what I would use by default if I were running a real A/B test, so they’re an applicable way for me to compare these two methods.
Result 2: Bayes tests have higher type II error rate for comparisons where there’s a large actual difference, but a lower one for data with a 90/10 split rather than 50/50
For type II error rate, we’ll run Bayes and hypothesis tests on two samples from different underlying distributions, and see how many of the tests fail to detect a significant difference. We’ll use the same decision rules as before (two-tailed p-value <= 0.05 for hypothesis and P(A>B) >= 97.5% or <= 2.5% for Bayes) and simulate 5 different types of data: Bernoulli data split 50/50 with a small difference, Bernoulli data split 50/50 with a large difference, Poisson data split 50/50, Bernoulli data split 90/10, and Bernoulli data with a low baseline rate.
I used 3 replications of 500 tests and plotted the median with min and max as error bars, as before. Interestingly, neither test has a lower type II error rate in all cases. Bayes outperforms hypothesis in the case of a 90/10 split, while hypothesis is slightly better (though not always outside of error) in the case of poisson or binomial data with a large difference. The two tests are comparable for underlying data with a small difference, regardless of baseline rate. This outcome seems to make intuitive sense, since the prior in a Bayes test does a poor job of describing the treatment group in cases where the treatment group is actually a lot different from control. So, it takes a little more work for your observed data to convince you away from your prior beliefs, vs. a hypothesis test which starts with no prior beliefs.
How does early stopping affect type I error rate?
To simulate a test with early stopping, we’ll use the same total sample size as in part 1 (1000 total draws per A/B test), but run the test every 100 samples and consider the result to be significant if any of those runs pass the decision rule. We can then compare this type I error to the values we found in part I. In an effort to make this not take forever to run, I’m only using 100 tests instead of 500 to estimate the type I error rate. The results indicate that Bayes testing with early stopping is preferable to hypothesis testing with early stopping.
However, we saw earlier that in these cases, Bayes testing had a generally lower type I error rate than hypothesis. So how much does early stopping worsen the type I error rate? When we do this comparison, we can see that early stopping impacts Bayes testing just as negatively as it impacts hypothesis testing in most cases!
How can we make sense of this result in light of the “does Bayes testing allow for peeking” debate? My take is that both sides are right, kind of. On one hand, Bayes testing never promises any level of Type I error – all it promises is to give probabilities based on the information available. How representative of the greater population is the data that we’ve collected? – No idea, and no promises made. So from this perspective, the result of a Bayes test is valid no matter how many times you peek at the data.
On the other hand, regardless of our school of thought, we do actually care about Type I error – that is, we do actually want to minimize the probability of wrongfully concluding that two distributions are different. In these simulations, peeking increases this probability by a huge amount (up to 15-20x!). So it definitely seems misleading to say “Bayes tests are immune to peeking”. In other words: Every time you peek at a Bayes test you get a valid result based on your current information, but you increase the chances of making a decision based on misleading current information. One could argue that I’m using the wrong decision rule for my Bayes test (expected loss is much more common in the field), but you’d have the same issue no matter what decision rule you use – you may get a valid expected loss, but there’s no guarantee that you’re getting a true result in the larger context.
In practice, I’ve seen that once a Bayes test P(A>B) comes anywhere close to the significance threshold, the day-to-day fluctuations are within a few percent at most. Given all of that, my takeaway is to keep track of Bayes test results as they come in, but to keep following the data for a couple extra days after the first positive result comes in. Essentially, “peeking is fine, but be skeptical about the outcomes of an early peek.”