Why your A/B-test needs confidence intervals

Published in

Criteo R&D Blog

14 min readSep 1, 2020

by Aloïs Bissuel, Vincent Grosbois and Benjamin Heymann

The recent media debate on COVID-19 drugs is a unique occasion to discuss why decision making in an uncertain environment is a complicated but fundamental topic in our technology-based, data-fueled societies.

Aside from systematic biases — which are also an important topic —any scientific experiment is subject to a lot of unfathomable, noisy, or random phenomena.For example when one wants to know the effect of a drug on a group of patients, one may always wonder “If I were to take a similar group to renew the test, would I get the same result?”
Such an experiment would at best end up with similar results. Hence the right question is probably something like “Would my conclusions still hold?” Several dangers await the uncautious decision-maker.

A first danger comes from the design and context of the experiment. Suppose you are given a billion dices, and that you roll each dice 10 times. If you get a dice that gave you ten times a 6, does it mean it is biased?
This is possible, but on the other hand, the design of this experiment made it extremely likely to observe such an outlier, even if all dices are fair.
In this case, the eleventh throw of this dice will likely not be a 6 again

A second danger comes from human factors. In particular incentives and cognitive bias.
Cognitive bias, because when someone is convinced of something, he is more inclined to listen to — and report — a positive rather than a negative signal.
Incentives, because society — shall it be in the workplace, the media or scientific communities — is more inclined to praise statistically positive results than negative ones.
For instance, suppose an R&D team is working on a module improvement by doing A/B tests. Suppose also that positive outcomes are very unlikely, and negative ones very likely, and that the result of experiments is noisy. An incautious decision-maker is likely to roll out more negative experiments than positive ones, hence the overall effort of the R&D team will end up deteriorating the module.

At Criteo, we use A/B testing to make decision while coping with uncertainty. It is something we understand well because it is at the heart of our activities.Still, A/B testing raises many questions that are technically involved, and so require some math and statistical analysis. We propose to answer some of them in this blog post.

We will introduce several statistical tools to determine if an A/B-test is significant, depending on the type of metric we are looking at. We will focus first on additive metrics, where simple statistical tools give direct results, and then we will introduce the bootstrap method which can be used in more general settings.

How to conclude on the significance of an A/B-test

We will present good practices based on the support of the distribution (binary / non-binary) and on the type of metric (additive / non-additive).
In the following sections, we will propose different techniques that allow us to assess if an A/B-test change is significant, or if we are not able to confidently conclude that the A/B-test had any effect.

We measure the impact of the change done in the A/B-test by looking at metrics. In the case of Criteo, this metric could for instance be “the number of sales made by a user on a given period”.

We measure the same metric on two populations: The reference and the test population. Each population has a different distribution, from which we gather data points (and eventually compute the metric). We also assume that the A/B-test has separated the population randomly and that the measures are independent with respect to each other.

Some special cases for additive metrics

Additive metrics can be computed at small granularity (for instance at the display or user level), and then summed up to form the final metric. Examples might include the total cost of an advertiser campaign (which is the sum of the cost of each display), or number of buyers for a specific partner.

The metric contains only zeros or ones

At Criteo, we have quite a few binary metrics that we use for A/B-test decision making. For instance, we want to compare the number of users who clicked on an ad or better yet, the number of buyers in the two populations of the A/B-test. The base variable here is the binary variable X which represents if a user has clicked an ad or bought a product.

For this type of variable, the test to use is a chi-squared Pearson test.

Let us say we want to compare the number of buyers between the reference population and the test population. Our null statistical hypothesis H0 is that the two populations have the same buying rate. The alternative hypothesis H1 is that the two populations do not have the same buying rate. We will fix the p-value to 0.05 for instance (meaning that we will accept a deviation of the test statistic to its 5% extremes under the null hypothesis before we reject it).

We have gathered the following data:

Experimental data (source: synthetic data)

Under the null hypothesis, population A and B are sampled from the same distribution, so that the empirical distribution of the sum of the two populations can be taken as the generative reference distribution. Here we get a conversion rate close to 1%.

Using this conversion rate for both populations, as the null hypothesis states, we can compute the expected distribution of buyers for populations A and B.

Expected distribution (source: synthetic data)

The chi-2 statistics is the sum, for all cells, of the square of the difference between the expected value and the real value, divided by the expected value. Here we get x2 = 0.743. We have chosen a p-value of 0.05, and thus for this two degree of freedom chi-2 distribution, the statistic needs to be inferior to 3.8. We accept the H0 hypothesis so that we deem this A/B-test as neutral.

Note that this test naturally accommodates unbalanced populations (as here the population B is half the size of the population A).

General case for an additive metric

When the additive metric is not composed of only zeros and ones, the central limit theorem can be used to derive instantly the confidence interval of the metric. The central limit theorem tells us that the distribution of the mean of i.i.d. variables follows a normal distribution of known parameters. By multiplying by the total number of summed elements, this gives us the confidence bound for the sum.

In the case of an A/B-test, things are a little bit more complicated as we need to compare two sums. Here, the test to use is a variant of the Student t-test. As the variance of the two samples are not expected to be the same, we have to use the specific version of the test which is adapted to this situation. It is called the Welch’s t-test. As for any statistical test, one needs to compute the statistic of the test, and compare it to the decision value given by the p-value chosen beforehand.

This method has a solid statistical basis and is computationally very efficient, as it only needs the mean and the variance of the statistic over the two populations.

The metric has no special property

The bootstrap method enables to conclude in this general case. We will present two versions of it. First, we introduce the exact one, which happens to be non-distributed. To accommodate larger data sets, a distributed version will be presented afterwards.

Bootstrap method (non-distributed)

Bootstrapping is a statistical method used for estimating the distribution of a statistic. It was pioneered by Efron in 1979. Its main strength is its ease of application on every metric, from the most simple to the most complicated. There are some metrics where you cannot apply the Central Limit Theorem, as for instance the median (or any quantile), or for a more business-related one, the cost per sale (it is a ratio of two sums of random variables). Its other strength lies in the minimal amount of assumptions needed to use it, which makes it a method of choice in most situations.

The exact distribution of the metric can never be fully known. One good way to get it would be to reproduce the experiment which led to the creation of the data set many times and compute the metric each time. This would be of course extremely costly, not to mention largely impractical when an experiment cannot be run exactly the same twice (such as when some sort of feedback exists between the model being tested in the experiment and the results).

Instead, the bootstrap method starts from data set as an empirical distribution and replays “new” experiments by drawing samples (with replacement) from this data set. This is infinitely more practical than redoing the experiment many times. It has, of course, a computational cost of recreating (see below for a discussion of how many) an entire data set by sampling with replacement. The sampling is done using a binomial law of parameters (n, 1/n).

Bootstrap method: A practical walk-through

First, decide on the number k of bootstraps you want to use. A reasonable number should be at least a hundred, depending on the final use of the bootstraps.
For every bootstrap to be computed, recreate a data set from the initial one with random sampling with replacement, this new data set having the same number of examples than the reference data set.
Compute the metric of interest on this new data set.
Use the k values of the metric either to conclude on the statistical significance of the test.

How to use bootstraps in the case of an A/B-test

When analyzing an A/B-test, there is not only one but two bootstraps to be made. The metric has to be bootstrapped on both populations. The bootstrapped difference of the metric between the two populations can be simply computed by subtracting the unsorted bootstraps of the metric of one population with the other. For additive metrics, this is equivalent to computing on the whole population a new metric which the same on one population and minus the previous one on the other population.

The bootstrapping has to be done in a fashion compatible with the split used for the A/B-test. Let us illustrate this by an example. At Criteo, we do nearly all our A/B-tests on users. We also think that most of the variability we see in our metrics comes from the users, so we bootstrap our metrics on users.

Finally, if the populations have different sizes, additive metrics need to be normalized so that the smaller population is comparable to the larger. This is not neccessary for intensive metrics (such as conversion rate or averages). For instance, to compare the effect of lighting conditions on the number of eggs laid in two different chicken coops, one larger than the other, the number of eggs has to be resized so that the two chicken coops are made of the same arbitrary size (note that here, we do not resize the number of chicken, as there might be no fixed relation between the size of the coop and the number of chicken). But studying the average number of eggs laid by every hen, no such resizing is needed (as it is already done by the averaging).

How to conclude on the significance of the A/B-test?

The bootstrap method gives an approximation of the distribution of the metric. This can be used directly in a statistical test, either non-parametric or parametric. For instance, given the distribution of the difference of some metric between the two populations, a non-parametric test would be, given H0 “the two population are the same, ie the difference of the metric should be zero” and a p-value of 0.05, to compute the quantiles [ 2.5%, 97.5%] by ordering the bootstrapped metric, and conclude that H0 is true if 0 is inside this confidence interval. Other methods exist for computing a confidence interval from a bootstrapped distribution. A parametric test would for example interpolate the distribution of the metric by a normal one (using the mean and the variance computed from the bootstraps) and conclude that H0 is true if 0 is less than (approximately) two standard deviations away from the mean. This is true only for metrics whose distribution is approximately normal (for instance a mean value, where the central limit theorem will apply).

An illustrated example

To explain how to use the bootstrap method for A/B-testing, let us go through a very easy example. Two populations A and B have both one million users each, and we are looking at the number of people who bought a product. For our synthetic example, population A has a conversion rate of exactly one percent, whereas population B has a conversion rate of 1.03%. We simulate the two datasets using Bernouilli trials, and end up with 10109 buyers in population A, and 10318 buyers in population B. Note that the empirical conversion rate of population A deviates far more from the true rate than for population B. The question asked is “Is the difference of the number of buyers significant?”. The null hypothesis is that the two populations are generated from the same distribution. We will take a confidence interval of [2.5%; 97.5%] to conclude.

We use a thousand bootstraps to see the empirical distribution of the number of buyers in both populations. Here is a plot below of the histogram of the bootstraps:

Empirical distribution of the number of buyers for populations A and B. 1000 bootstraps were used. (source: synthetic data)

As predicted by the central limit theorem, the distributions are roughly normal and centered around the empirical number of buyers (dashed vertical lines). The two distributions overlap.

To find the distribution of the difference of the number of buyers, the bootstraps of both populations are randomly shuffled and subtracted one by one. Another way to look at it is to concatenate the two data sets together and create a metric with a +1 for a buyer in A and -1 for a buyer in B. The sum of this metric gives the difference of number of buyers in the two populations. If the random sampling is compatible with the population split, this gives exactly the same result. Below is the result of this operation. As expected by the central limit theorem, this is a normal distribution. The color indicates if a bootstrap is positive or negative.

Histogram of difference of number of buyers between the two populations. Color indicate if bootstrap is positive or negative. (source: synthetic data)

In this case, the number of positive bootstraps is 74 out of 1000. This means that only 7.4% of bootstraps show a positive difference inside the confidence interval of [2.5%; 97.5%] which we decided earlier on. Subsequently, we cannot reject the null hypothesis. We can only say that the A/B-test is inconclusive.

Bootstrap method (distributed)

When working with large-scale data, several things can happen to the data set:

The data set is distributed across machines and too big to fit into a single’s machine memory
The data set is distributed across machines and even though it could fit into a single machine, it is costly and impractical to collect everything on a single machine
The data set is distributed across machines and even the values of an observation are distributed across machines. For instance, it happens if the observation we are interested in is “how much did the user spend for all his purchases in the last 7 days” and if our data set contains data at purchase-level. To apply the previous methods, we would first need to regroup our purchase-level data set into a user-level data set, which can be costly.

In this section, we describe a bootstrap algorithm that can work in this distributed setting. As it is also able to work on any distribution type, it is thus the most general algorithm to get confidence intervals.
Let us first define k, the number of resamplings that we will do over the distribution of X (usually, at least 100). To get S1, the sum of the first resampled draw, we would normally draw with replacement n elements from the series of X_i, and then sum them.

However, in a distributed setting, it might be impossible or too costly to do a draw with replacement of the distributed series X_i. We instead use an approximation: Sum each element of the data set, weighted with a weight drawn from a Poisson(1) distribution. This weight is also “seeded” by both; the bootstrap population id (so from 1 to k) and also an identifier that represents the aggregation level our data should be grouped by. Doing this ensures that, if we go by the previous example, purchases made by the same user get the same weight for a given bootstrap resampling: We get the same end result as if we had already grouped the data set by user in the first place.

We thus obtain S1, an approximate version of the sum from resampled data. Once we obtain the S1… S_k series, we use this as a proxy for the real distribution of S and can then take the quantiles from this series. Note that this raises 2 issues:

An approximation error is made due to the fact that the number of bootstraps is finite (the bootstrap method converges for an infinite number of resampling)
An approximation error that induces a bias in the measure of quantiles is made due to the fact that the resampling is done approximately. This comes from the fact that for a given resampling, the sum of the weights is not exactly equal to n (this is only true on average). For further reading, check out this great blog post and the associated scientific paper.

Summary

Comparative pro/cons

Here is a short table highlighting the pros and cons of each method.

Conclusion

In this article, we have detailed how to analyze A/B-tests using various methods, depending on the type of metric which is analyzed and the amount of data. While all methods have their advantages and their drawbacks, we tend to use bootstraps most of the time at Criteo, because it is usable in most cases and lends itself well to distributed computing. As our data sets tend to be very large, we always use the distributed bootstraps method. We would not be able to afford the numerical complexity of the exact bootstrap method.

We also think that using the Chi-2 and T-test methods from time to time is useful. Using these methods forces you to follow the strict statistical hypothesis testing framework, and ask yourself the right questions. As you may find yourself using the bootstrap method a lot, it allows to keep a critical eye on it, and not take blindly its figures for granted!

About the authors

Aloïs Bissuel, Vincent Grosbois and Benjamin Heymann work in the same team at Criteo, working on measuring all sorts of things.

Benjamin Heymann is a senior researcher, while Aloïs Bissuel and Vincent Grosbois are both senior machine learning engineers.

Like what you are reading? Check out our latest articles on Medium.

From .NET Framework to .NET Core: History of a C# build farm

… or Criteo’s continuously evolving .NET build system.

medium.com

How Much Can Bad Data Cost Us?

The never-ending fight to cease bad data