Students with more prior knowledge (high L) attend the tutorial more often (L→A) and achieve better grades anyway (L→Y). This creates a spurious correlation: tutorial participants perform better — but not only because of the tutorial, but because they already knew more.
Why does the naive group comparison mislead? Students who attend the tutorial (A=1) already have more prior knowledge on average — better grades would partly come “on their own”. The naive mean comparison E[Y|A=1] − E[Y|A=0] mixes the true tutorial effect with the prior-knowledge advantage: Naive estimate = Causal effect + Confounding bias. Only in a randomised experiment (RCT) would both groups be comparable — in observational studies we need G-Computation.
G-Computation asks for each person the counterfactual question: “What would the grade be with the same prior knowledge — once with, once without tutorial?” The averaged difference is the causal effect — adjusted for prior knowledge.
—
DGP of this tool.
All data are simulated according to the following data-generating process:
L ~ Normal(0,1) — confounder (e.g. health status)
A | L ~ Bernoulli(σ(γ_A·L)) — treatment depends on L (via γ_A)
Y | A, L ~ Normal((θ + ξ·L)·A + γ_Y·L, 1) — outcome
θ is the true causal effect of A on Y. γ_A controls how strongly L drives treatment selection; γ_Y controls how strongly L affects the outcome. In real data θ, γ_A and γ_Y are unknown; here we can set all three.
G-Computation: fully Bayesian.
The outcome model Y ~ A * L is fit in a Bayesian framework: flat priors, then K samples from the posterior. For each sample we compute for each person "What would Y be under A=1 vs. A=0?" and average the difference across the target population. The result is a proper posterior distribution of the estimand.
Why different populations for ATE, ATT, ATU?
The outcome model applies to everyone, but the question of for whom we average defines the estimand:
— ATE: Ŷ(1)−Ŷ(0) averaged over all n persons → "What would the effect be in the total population?" In code: avg_comparisons(fit_gc, variables = "A")
— ATT: only the n₁ actually treated (A=1) → "Did the treatment help those who received it?" Their L-values are systematically higher (since treatment correlates with L). In code: newdata = subset(dat, A==1)
— ATU: only the n₀ untreated (A=0) → "What would happen if we extended treatment to them?" In code: newdata = subset(dat, A==0)
Without effect heterogeneity (ξ=0) all three values are equal. With ξ≠0 they diverge — because the subpopulations have different L-distributions and the effect depends on L.
Browser approximation.
For illustration this tool uses Normal-Normal conjugacy: with a Normal likelihood and Normal prior the posterior is analytically tractable, no MCMC needed. The R code shows brms with full MCMC — which is the recommended approach for your own analyses.
Effect heterogeneity (ξ).
Without heterogeneity (ξ=0): Y = θ·A + γ_Y·L + ε — the causal effect θ is the same for everyone, regardless of L. ATE = ATT = ATU = θ, and in the effect plot (4th panel) all points lie on a horizontal line.
With ξ=0.5: Y = (θ + 0.5·L)·A + γ_Y·L + ε — persons with higher L benefit more. A positive slope appears in the effect plot. Since with positive confounding the treated have higher L-values on average (they were more often treated): ATT > ATE > ATU. Same data, same method, three different correct answers — because three different population questions are asked.
Overlap (positivity assumption).
Overlap means: for every observed L-value there are both treated and untreated persons — 0 < P(A=1|L=l) < 1. This is visible in the first scatter plot: with good overlap, orange (A=1) and blue (A=0) points mix across the entire L-range. In the "No overlap" scenario the groups are almost completely separated — persons with high L are almost always treated.
This is a problem because G-Computation then extrapolates the outcome model into L-regions where no (or very few) controls were observed. The estimate becomes highly model-dependent — visible as a wider posterior and greater uncertainty.
When is G-Computation essential — and when does the regression coefficient suffice?
In the linear model without interaction (Y ~ A + L) the G-Computation ATE equals exactly the regression coefficient β̂_A: linearity ensures that conditional and marginal effects coincide (without confounding, G-Computation is even identical to a t-test).
In the linear model with interaction (Y ~ A + L + A:L) this no longer holds in general: β̂_A is only the effect at L=0, while G-Computation correctly averages over the actual L-distribution of the target population (ATE = β̂_A + β̂_{AL}·Ē[L]). Moreover, G-Computation computes ATT and ATU by restricting to the respective subpopulation — this is not possible with the coefficient alone.
Formula choice: Y ~ A + L versus Y ~ A * L
Standard DAGs encode causal structure (which variable influences which) — but not the functional form of those relationships. Whether the effect of A on Y differs across levels of L (effect heterogeneity, A:L interaction) is a substantive claim that goes beyond the graph: “Is it plausible that L moderates the effect of A?”
The tool therefore always specifies the outcome model as Y ~ A * L — a conservative default that allows heterogeneity without assuming it. Whether A:L is actually needed can be tested empirically with loo_compare(loo(fit_gc), loo(fit_add)) (see R code). If evidence for the interaction is lacking, Y ~ A + L is the more parsimonious model. The Golem Builder helps decide whether a moderation relationship is theoretically justified in the DAG.
G-Computation is essential for GLMs with non-linear link functions (e.g. logit for logistic regression, log for Poisson): there the coefficient is a conditional effect on the link scale (e.g. log-odds ratio). Due to the non-collapsibility of the logit link, conditional and marginal effects differ fundamentally — even without confounding. G-Computation delivers the marginal effect on the response scale (e.g. risk difference in probability points), which the coefficient alone cannot provide. avg_comparisons() in marginaleffects does exactly that, model-class-agnostically.
Extensive examples of G-Computation with brms for various GLMs (logistic, Poisson, multinomial, etc.) can be found on the blog of Solomon Kurz: Boost your power with baseline covariates.
OVB formula — why the naive estimator is biased.
The naive regression coefficient β̂naiv from Y ~ A equals (in the population):
= θ + γY · Cov(L, A) / Var(A)
= θ (true effect) + γY · δA∼L (confounding bias)
δA∼L = Cov(L,A)/Var(A) is the slope coefficient from the auxiliary regression A ~ L — it measures how strongly L predicts treatment assignment. The product γY · δA∼L is the Omitted Variable Bias: it vanishes exactly when either L does not affect the outcome (γY=0) or L is independent of A (determined by γA) (δA∼L=0).
Sign of the bias. In this tool γ_A (L→A) and γ_Y (L→Y) can be set independently, yielding the following sign structure:
| L→A | L→Y | Bias = (L→Y)×(L→A) | Naive vs. True |
|---|---|---|---|
| + | + | + (overestimation) | Naive larger |
| − | − | + (−×− = +) | Naive larger (often surprising!) |
| + | − | − (underestimation) | Naive smaller |
| − | + | − (underestimation) | Naive smaller (wrong sign possible!) |
Rule of thumb: Bias > 0 when both paths have the same sign (both + or both −). Bias < 0 when signs differ. In the tool γ_A and γ_Y can be set independently — same sign means positive bias, opposite signs mean negative bias.
Heckman decomposition: Naive comparison = ATE + Selection Bias + HTEB.
The naive comparison E[Y(1)|A=1] − E[Y(0)|A=0] generally does not equal the ATE (Cunningham, 2021; Morgan & Winship, 2007). The ATE is first a weighted sum of ATT and ATU (π = P(A=1)):
After algebraic rearrangement (add and subtract E[Y(0)|A=1]) the following holds directly:
+ {E[Y(0)|A=1] − E[Y(0)|A=0]} ← Selection Bias / Baseline Bias
+ (1−π) · (ATT − ATU) ← Heterogenous Treatment Effect Bias (HTEB)
Interpretation of the terms:
— Selection Bias / Baseline Bias = E[Y(0)|A=1] − E[Y(0)|A=0]: How would the two groups differ if there were no treatment from the start? It is simply a description of baseline differences between the two groups under the control condition.
— HTEB (Heterogenous Treatment Effect Bias) = (1−π)·(ATT−ATU): The expected difference in treatment effect between those in the treatment and control groups (multiplied by the population share). Arises when ξ≠0 in this tool.
When does Naive = ATE? Exactly when both terms vanish:
— Selection Bias = 0 ⟺ E[Y(0)|A=1] = E[Y(0)|A=0] ⟺ Y(0) ⊥ A (no baseline confounding)
— HTEB = 0 ⟺ ATT = ATU (no heterogeneity problem) or π = 1
In an RCT treatment is randomly assigned → Y(0) ⊥ A (Selection Bias = 0) and ATT ≈ ATU (HTEB ≈ 0) → Naive = ATE ✓
Source: Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. Freely available online — with many further techniques for causal analysis: IPW, Difference-in-Differences, Regression Discontinuity, Instrumental Variables and more. An excellent introduction to the full breadth of causal inference methods.