Tabaré Capitán

Introducing ritest: randomisation inference in Python

Mon, 05 Jan 2026 23:00:00 GMT

A few months ago I was analysing data from a randomised experiment aimed at increasing product adoption. It was the kind of project that shows up everywhere: a new feature ships, some users see it, some do not, and the goal is to figure out whether the feature had the intended effect.

The obvious next step is a -test. That is what most analyses of this kind start with, and often where they stop.

But in this setting, the only thing that was actually random was the assignment itself: who saw the feature and who did not. The outcomes were not sampled at random from a population; they were observed after a deliberate assignment.

Instead of asking what would happen if I repeatedly sampled new users, I wanted to know what would have happened under different random assignments of the same users. This is the logic of randomisation inference.

I’ve done this before in Stata, where a well-established command,ritest, covers most practical uses of randomisation inference. But I was working in Python. I found tools that cover some uses, but I did not find a functional equivalent to Stata’s ritest.

So I wrote Python’s ritest.

This post is a short announcement. My new package, ritest, brings a familiar randomisation inference tool to Python. It is designed to be easy to use, flexible, and fast.

Randomisation inference

When an experiment is randomised, there are two different stories you can tell about uncertainty.

One story is the ‘sampling’ story. You imagine your dataset as one draw from a larger population, and you ask what would happen if you could repeat the data-collection process. That is the story behind most textbook standard errors and t-tests.

The other story is the ‘assignment’ story. You hold the outcomes fixed and ask what would have happened under different random assignments of the same treatment. That is the story behind randomisation inference.

Operationally, randomisation inference is simple:

pick a statistic that measures the effect you care about
compute it on the observed assignment
recompute it under many alternative assignments that respect the experimental design
compare the observed statistic to its randomisation distribution

That’s it. The hard part, in practice, is doing it in a way that is fast enough to use, and strict enough about the design to be trustworthy.

Features

ritest supports two ways of defining the test statistic. In the most common case, the statistic is a coefficient from a linear model, specified through a regression formula. When that is not appropriate, you can instead provide a custom Python function that maps the data to a single scalar statistic.

In both cases, permutations can be constrained to respect the experimental design, including stratified randomisation, clustered assignment, and optional weighting on the linear path.

By default, ritest makes the Monte Carlo uncertainty in the p-value explicit when permutations are sampled rather than enumerated (which is almost always true). In that case, the p-value itself is an estimate, and the output includes a confidence interval for that estimate. On the linear path, the package also reports coefficient bounds (or a confidence interval) by default.

The package can be installed from PyPI:

pip install ritest-python

Example

Here is a realistic pattern from product work. Imagine a rollout where users are randomised to see a new onboarding flow. The outcome is whether the user activates within 7 days. You also have pre-treatment covariates that help with precision (previous activity, device type, country). The effect you want is the coefficient on treat.

import pandas as pd
from ritest import ritest

# Example column meanings:
# - activated_7d: 0/1 (activated within 7 days)
# - treat: 0/1 (assigned to new onboarding)
# - pre_usage: numeric (pre-treatment engagement)
# - device_ios: 0/1 (pre-built dummy; you can build dummies upstream)
# - region_eu: 0/1 (pre-built dummy)
# - strata_id: str/int (block or bucket used in the randomisation)

res = ritest(
    df=df,
    permute_var="treat",
    formula="activated_7d ~ treat + pre_usage + device_ios + region_eu",
    stat="treat",
    strata="strata_id",
    reps=5000,
    alpha=0.05,
    seed=23,
)

print(res.summary())

This is the workflow I wanted: I can express the estimand as a familiar regression coefficient, and I can get assignment-based uncertainty without pretending the only randomness in the problem is sampling noise.

Now imagine that the adoption question is not your bottleneck. Your bottleneck is latency: you care about the median time-to-value, which is skewed and full of long tails. You still have a randomised assignment, but you do not want to force the problem into a linear model.

That is what the generic path is for.

from ritest import ritest

def median_diff(d):
    treated = d.loc[d["treat"] == 1, "time_to_value_hours"].median()
    control = d.loc[d["treat"] == 0, "time_to_value_hours"].median()
    return treated - control

res = ritest(
    df=df,
    permute_var="treat",
    stat_fn=median_diff,
    reps=5000,
    alpha=0.05,
    seed=23,
)

print(res.pvalue)

The point is not that medians are “better” than conditional means. The point is that a real workflow often has both kinds of questions, and the underlying source of uncertainty (the assignment) is the same.

Conclusion

I built this package because I needed it. The project grew well beyond my original plan as I tried to emulate, in Python, the same sense of convenience I had relied on when doing randomisation inference in Stata. I’m happy with the result, and I hope others find it useful. Since this is my first time releasing a package on PyPI, I genuinely want to hear what people think.

Finally, I want to encourage data scientists, data analysts, and researchers who are not familiar with randomisation inference to take a closer look. Randomisation inference can be appropriate whenever assignment is controlled and known. This is a common setting in many contexts: A/B testing in product and platform experiments, randomised controlled trials in economics and political science, greenhouse and field experiments in agricultural science, and laboratory or clinical studies in life sciences. If the main source of uncertainty in your problem comes from the design itself, randomisation inference may be right for you.

On inference

Sun, 28 Dec 2025 23:00:00 GMT

Consider the following hypothetical example. Spotify is investing in audiobooks, and wants to learn how much more discovery it can drive without harming core music listening. An obvious first step is an A/B test: add an ‘Audiobooks’ shelf to the Home feed for some eligible users. After a couple of weeks, estimate the treated minus control difference in time spent listening to books, with a guardrail like time spent listening to music.

If the assignment was random, that difference has a clean causal interpretation for this experiment, for these users, over this window. Identification is straightforward. Inference is more nuanced: how uncertain is the estimate of the difference, and uncertain about what? In other words, what can we infer about the world beyond this particular experiment?

This post is me trying to get the concept of inference straight. I’m going to treat ‘inference’ as a question about the story of what could have happened, not as a set of techniques I can apply.

Inference is a thought experiment

Inference is always a thought experiment. In our example, we get a point estimate for that experiment, for those users, over that window. What if we had another experiment? What if we had other users? What if we had another window? Unfortunately, that we cannot see. And so we rely on thought experiments.

Confidence intervals and -values answer those ‘what if’ questions within a given thought experiment: What would we see if the world replayed repeatedly, in some relevant sense? That replay is not a minor detail. It is the definition of what your uncertainty statement means. And in most settings there are two replay modes that make immediate sense.

Mode 1: replay the users. Imagine Spotify could re-run the same experiment many times, but each time the platform happens to see a different slice of users: different people are active, eligible, reachable, or simply online during your windows. You run the same A/B each time, and your estimate moves around because the people changed. This is sampling-based uncertainty.

That story corresponds to the classical statistical inference we typically encounter in textbooks and beyond: -values motivated via repeated sampling. The key idea is that your effect (point estimate) could have been different had your sample of users been different. That is the uncertainty you are trying to estimate.

Mode 2: replay the assignment. Now hold the users fixed. Imagine Spotify could take the same users, same window, and same (potential) outcomes. The only thing you replay is the randomisation: who got the ‘Audiobooks’ shelf and who didn’t, respecting whatever rules you originally used (such as equal split, stratification, blocked randomisation, and so on).

That story is the realm of randomisation inference. It is also the clean way to interpret permutation tests in an experiment: you are not permuting “because it is non-parametric”; you are generating the distribution of your statistic under the assignment mechanism you actually used.

Quantifying uncertainty

I find it useful to think about inference in three layers. The first layer is the estimator (or statistic): it produces an estimate of a target effect . You need something to make an inference about. Furthermore, random assignment lends causal credibility to the interpretation of the estimate. The second layer relates to the scope of the inference. Are we trying to make inferences about the broader population we want to generalise to? (replay mode 1) Or are we trying to make inferences about who happened to see the ‘Audiobooks’ shelf due to the particular realisation of the randomisation process? (replay mode 2) Or maybe both? The third layer refers to the quantification of the uncertainty within the scope of the inference. It is here that we can find the many methods that take the first two layers and turn them into -values and confidence intervals.

For example, in our hypothetical experiment, the first layer is the estimator. We compute the treatment effect estimate , for instance as the OLS coefficient on the treatment indicator, which (with an intercept) is algebraically equal to the treated–control difference in means,

Suppose that, in the second layer, we adopt a sampling-based replay story: the estimate would have been different had the experiment observed a different random sample of users.

In the third layer, we can quantify that uncertainty in several ways; with its interpretation being contingent on the sampling-based story.

A standard error is an absolute measure of dispersion. It estimates the variability of the estimator across repeated samples,

A confidence interval converts the same idea into an absolute uncertainty range for the estimand (the target effect). Under a Normal approximation, a confidence interval for is where denotes the corresponding quantile of the standard Normal distribution (or the appropriate quantile in finite samples).

A -value is different in nature: it is defined only relative to a hypothesis. If we wish to assess compatibility with a specific reference value (often ), we form the standardised statistic

Under the sampling-based assumptions and the chosen reference distribution, the -value is that is, the probability—computed under the null hypothesis—that a re-sampled experiment would produce a standardised statistic at least as extreme as the one observed.¹

So far, this may look like ‘methods’: -tests, confidence intervals, p-values. But the three layers are the point. None of these outputs make sense in isolation. The meaning comes from (i) the estimator, (ii) the replay story, and only then (iii) the calculator used to turn the story into numbers.

Uncertainty calculators

Now that the estimator and the replay story are fixed, the remaining question is how to compute uncertainty within that scope. This is where most methods people recognise live. They are mostly different ways of approximating the same object: the distribution of under the chosen replay mode.

There are two broad ways to get that distribution.

Route A: estimate variability, then approximate a reference distribution. This is the standard error route. You compute , form a standardised statistic, and then map it to a -value (or CI) using a reference distribution (Normal or in simple cases). This family includes the classic -test and its close relatives (Wald tests, tests, tests), all of which share the same structure:

Within this route, you still have choices about the variability estimate. In regression output, for instance, a model-based OLS standard error is tied to a particular noise model, while a robust (sandwich) standard error is designed to be less dependent on that model. The estimator can be identical, while the attached uncertainty calculation changes because the calculator changed. That is why Freedman’s (2008) warning matters even in experiments: randomisation can justify the causal meaning of , while leaving room for disagreement (or mistakes) about the standard error attached to it.

Route B: build a reference distribution directly by replaying the world. This is the resampling or re-randomisation route. Instead of estimating an and leaning on a Normal or approximation, you generate many ‘replays’ and recompute the statistic each time. The output is an empirical distribution of , from which you can read off uncertainty summaries.

Two big families sit here:

Bootstrap and jackknife (sampling replay): you replay which users you observed by resampling units (bootstrap) or systematically leaving them out (jackknife). You can then compute as the standard deviation of the replicated estimates, or compute confidence intervals from quantiles of the empirical distribution. A -value is also possible, but it requires an explicit hypothesis construction, just like before.
Randomisation inference (assignment replay): you replay who was treated by re-running the randomisation procedure many times, respecting the original design. Under a sharp null, this directly gives a reference distribution for your statistic under the assignment mechanism.² A -value can be computed with the most literal tail probability: where is the number of simulated reassignments and is the statistic under reassignment . Notice what is missing: there is no required step of ‘estimate an and assume Normality’. The design supplies the reference distribution.³

Practical implications

You may have felt something off up to this point. In our Spotify example, the effect estimate is justified by random assignment (design logic), while a lot of standard inference is presented through a sampling lens. Furthermore, in plain A/B tests, you often find that robust -tests, bootstrap uncertainty, and randomisation-based checks all tell the same story.

This is not because the layers collapse into one. It is because the situation is unusually ‘friendly’:

The estimator is simple. A difference in means is a stable object.
Sample sizes are large. Many distributions become well-behaved once you have enough users, and many reasonable standardisations start to look similar.
Different variance calculators converge. In the binary-treatment case, several common standard-error formulas are built from the same ingredients (treated and control variability and group sizes), so their numerical differences can get washed out.

If all you need is a quick answer to ‘did the shelf move audiobook listening?’, this is why your preferred software’s default often feels like it ‘just works’.

But the friendly zone is not guaranteed. The moment the design stops being “randomise users 50/50”, the replay world changes, and the calculator has to match it.

For example, if you randomised within strata (country, device, prior engagement), then ‘replay the assignment’ means reshuffling within strata. A calculator that ignores this is quantifying uncertainty for a world that never could have happened. Alternatively, if assignment happens at a higher level (households, classrooms, markets), the effective sample size is the number of clusters, not the number of users. Many default approximations become fragile when there are few clusters.

This is the practical take of the three layers: inference is not a button you press after you get . It is the combination of (i) what you estimated, (ii) what you think could have happened, and (iii) how you chose to quantify that.

Reading list

The references below have been helpful to me; they are not in any way meant as a comprehensive survey.

⭐ Abadie, A., Athey, S., Imbens, G., and Wooldridge, J. 2020. “Sampling-Based versus Design-Based Uncertainty in Regression Analysis.” Econometrica. link

Athey, S., & Imbens, G. W. (2017). The econometrics of randomized experiments. In Handbook of economic field experiments. North-Holland. link

Freedman, David A. 2008. “On Regression Adjustments to Experimental Data.” Advances in Applied Mathematics. link

Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press. link

Spotify. 2025-03-13. “How Spotify Is Driving Growth, Discovery, and Innovation in the Audiobook Market.” Spotify Newsroom. link

Footnotes

Note that inference can exist without hypothesis testing; testing is a decision layered on top of an uncertainty statement, not its foundation.↩︎
Without additional structure, this exactness is tied to sharp nulls; a null about an average effect does not by itself pin down the missing potential outcomes. You can read more about randomisation inference with weak nulls (such as related to ATE) in this paper published in JASA.↩︎
If you approximate a randomisation -value by simulation, then has Monte Carlo error because is finite. If is the number of simulated statistics at least as extreme as , then . Treating gives a simple way to compute a confidence interval for (for example via a Clopper–Pearson or Wilson interval). This is uncertainty about the Monte Carlo approximation, not uncertainty about the treatment effect.↩︎