Linear vs generic


From the Colombian example we see that ritest in Python works and it is reasonably fast. That is a valid benchmark for linear models with multiple fixed effects. In this section I benchmark using simple OLS against Stata and R’s implementations of randomization inference.

Data

The simulated dataset represents a clean, randomized experiment with \(N=10{,}000\) observations, designed to resemble realistic applied data. Treatment is randomly assigned with equal probability (50/50), with no clustering and no stratification. The dataset includes 20 covariates covering demographics, behavior, engagement, and region indicators. These covariates are generated from a small number of latent factors, inducing realistic correlations while remaining fully observed and free of missing values.

The outcome is continuous and driven by covariates plus a very small heterogeneous treatment effect. The individual treatment effect is

\[\tau_i = 0.02 + 0.0075 \cdot h_i\] ​ where \(h_i\) is a standardized index based on baseline spending and satisfaction, so the true average treatment effect is approximately 0.02. Outcome noise is Gaussian with standard deviation \(5\), chosen so that at \(N = 10{,}000\) the problem is nontrivial but stable, with an expected standard error for the treatment coefficient of about \(0.05\). This implies considerable design-based variability, so point estimates and \(p\)-values can differ noticeably across realizations and across implementations. In addition, randomization inference relies on a finite number of permutations, which introduces a small amount of Monte Carlo error. On average, however, inference is expected to provide little evidence against the null hypothesis of no treatment effect.

Stata

Set a local macro with all covariates except the permute variable treat:

local xvars ///
    age female education_years log_income household_size urban      ///
    tenure_months baseline_spend purchases_12m returns_12m          ///
    support_tickets_6m app_sessions_30d days_since_last_purchase    ///
    email_opt_in promo_exposure_30d prior_churn credit_score        ///
    satisfaction_score region_1 region_2 region_3 region_4

Run ritest:

ritest treat _b[treat], reps(2000) seed(23) nodots: reg y treat `xvars'
. ritest treat _b[treat], reps(2000) seed(23) nodots: reg y treat `xvars'

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(23, 9976)     =    701.81
       Model |  389011.799        23  16913.5565   Prob > F        =    0.0000
    Residual |  240422.079     9,976   24.100048   R-squared       =    0.6180
-------------+----------------------------------   Adj R-squared   =    0.6172
       Total |  629433.877     9,999  62.9496827   Root MSE        =    4.9092

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       treat |   .0819918   .0983118     0.83   0.404    -.1107191    .2747028
       ... (clipped for brevity)
------------------------------------------------------------------------------

      Command: regress y treat age female education_years log_income
                   household_size urban tenure_months baseline_spend
                   purchases_12m returns_12m support_tickets_6m
                   app_sessions_30d days_since_last_purchase email_opt_in
                   promo_exposure_30d prior_churn credit_score
                   satisfaction_score region_1 region_2 region_3 region_4
        _pm_1: _b[treat]
  res. var(s):  treat
   Resampling:  Permuting treat
Clust. var(s):  __000001
     Clusters:  10000
Strata var(s):  none
       Strata:  1

------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _pm_1 |   .0819918     786    2000  0.3930  0.0109  .3715147   .4147996
------------------------------------------------------------------------------
Note: Confidence interval is with respect to p=c/n.
Note: c = #{|T| >= |T(obs)|}

183 s or about 3 minutes.

R

Estimate a linear model, save in est:

fml <- y ~ treat +
  age + female + education_years + log_income + household_size + urban + tenure_months +
  baseline_spend + purchases_12m + returns_12m + support_tickets_6m + app_sessions_30d +
  days_since_last_purchase + email_opt_in + promo_exposure_30d + prior_churn + credit_score +
  satisfaction_score + region_1 + region_2 + region_3 + region_4

est <- lm(fml, data = df)

Run ritest, save results in ri:

ri <- ritest(est, "treat", reps = 2000, seed = 23L)

Print results.

print(ri)
OLS (lm) fit:

Call:
lm(formula = fml, data = df)

Residuals:
     Min       1Q   Median       3Q      Max
-19.0734  -3.3147   0.0069   3.3346  19.3317

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)
(Intercept)              -2.008e+01  1.414e+00 -14.205  < 2e-16 ***
treat                     8.199e-02  9.831e-02   0.834 0.404302
... (clipped for brevity)
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.909 on 9976 degrees of freedom
Multiple R-squared:  0.618, Adjusted R-squared:  0.6172
F-statistic: 701.8 on 23 and 9976 DF,  p-value: < 2.2e-16


RI result:

          Call: lm(formula = fml, data = df)
   Res. var(s): treat
            H0: treat=0
     Num. reps: 2000
────────────────────────────────────────────────────────────────────────────────
  T(obs)         c         n     p=c/n     SE(p)   CI 2.5%  CI 97.5%
 0.08199       839      2000    0.4195   0.01815    0.3896    0.4494
────────────────────────────────────────────────────────────────────────────────
Note: Confidence interval is with respect to p=c/n.
Note: c = #{|T| >= |T(obs)|}

11.24 seconds.

Python

Set up the model:

PERMUTE_VAR = "treat"
STAT = "treat"

COVARS = [
    "age",
    "female",
    "education_years",
    "log_income",
    "household_size",
    "urban",
    "tenure_months",
    "baseline_spend",
    "purchases_12m",
    "returns_12m",
    "support_tickets_6m",
    "app_sessions_30d",
    "days_since_last_purchase",
    "email_opt_in",
    "promo_exposure_30d",
    "prior_churn",
    "credit_score",
    "satisfaction_score",
    "region_1",
    "region_2",
    "region_3",
    "region_4",
]

FORMULA = "y ~ treat + " + " + ".join(COVARS)

Run ritest:

res = ritest(
       df=df,
       permute_var=PERMUTE_VAR,
       formula=FORMULA,
       stat=STAT,
       reps=reps,
       seed=seed,
    )

Print results:

print(summary)
Randomization Inference Result
===============================

Coefficient
-----------
Observed effect (β̂):   0.0820
Coefficient CI bounds: [-0.1133, 0.2778]
Coefficient CI band:   not computed

Permutation test
----------------
Tail (alternative):     two-sided
p-value:                0.3990 (39.9%)
P-value CI @ α=0.050: [0.3775, 0.4208]
As-or-more extreme:     798 / 2000

Test configuration
------------------
Stratified:             no
Clustered:              no
Weights:                no

Settings
--------
alpha:                  0.050
seed:                   23
ci_method:              cp
ci_mode:                bounds
n_jobs:                 4

3.274 seconds.

Discussion

The result of the randomization inference aligns across the three different implementations. In terms of performance, the Python implementation is fastest (3.274 s), followed by the R implementation (11.24 s), and the Stata implementation (183 s). Compared to the Colombian example, in which the R implementation was faster due to its explicit integration of fixest to handle estimation with multiple fixed effects, the implementation in Python is now faster due to its fast engine specialized in OLS.

Back to top