Students with more prior knowledge (high L) attend the tutorial more often (LβA) and achieve better grades anyway (LβY). This creates a spurious correlation: tutorial participants perform better β but not only because of the tutorial, but because they already knew more.
Why does the naive group comparison mislead? Students who attend the tutorial (A=1) already have more prior knowledge on average β better grades would partly come βon their ownβ. The naive mean comparison E[Y|A=1] β E[Y|A=0] mixes the true tutorial effect with the prior-knowledge advantage: Naive estimate = Causal effect + Confounding bias. Only in a randomised experiment (RCT) would both groups be comparable β in observational studies we need G-Computation.
G-Computation asks for each person the counterfactual question: βWhat would the grade be with the same prior knowledge β once with, once without tutorial?β The averaged difference is the causal effect β adjusted for prior knowledge.
—
DGP of this tool.
All data are simulated according to the following data-generating process:
L ~ Normal(0,1) β confounder (e.g. health status)
A | L ~ Bernoulli(Ο(Ξ³_AΒ·L)) β treatment depends on L (via Ξ³_A)
Y | A, L ~ Normal((ΞΈ + ΞΎΒ·L)Β·A + Ξ³_YΒ·L, 1) β outcome
ΞΈ is the true causal effect of A on Y. Ξ³_A controls how strongly L drives treatment selection; Ξ³_Y controls how strongly L affects the outcome. In real data ΞΈ, Ξ³_A and Ξ³_Y are unknown; here we can set all three.
G-Computation: fully Bayesian.
The outcome model Y ~ A * L is fit in a Bayesian framework: flat priors, then K samples from the posterior. For each sample we compute for each person "What would Y be under A=1 vs. A=0?" and average the difference across the target population. The result is a proper posterior distribution of the estimand.
Why different populations for ATE, ATT, ATU?
The outcome model applies to everyone, but the question of for whom we average defines the estimand:
β ATE: ΕΆ(1)βΕΆ(0) averaged over all n persons β "What would the effect be in the total population?" In code: avg_comparisons(fit_gc, variables = "A")
β ATT: only the nβ actually treated (A=1) β "Did the treatment help those who received it?" Their L-values are systematically higher (since treatment correlates with L). In code: newdata = subset(dat, A==1)
β ATU: only the nβ untreated (A=0) β "What would happen if we extended treatment to them?" In code: newdata = subset(dat, A==0)
Without effect heterogeneity (ΞΎ=0) all three values are equal. With ΞΎβ 0 they diverge β because the subpopulations have different L-distributions and the effect depends on L.
Browser approximation.
For illustration this tool uses Normal-Normal conjugacy: with a Normal likelihood and Normal prior the posterior is analytically tractable, no MCMC needed. The R code shows brms with full MCMC β which is the recommended approach for your own analyses.
Effect heterogeneity (ΞΎ).
Without heterogeneity (ΞΎ=0): Y = ΞΈΒ·A + Ξ³_YΒ·L + Ξ΅ β the causal effect ΞΈ is the same for everyone, regardless of L. ATE = ATT = ATU = ΞΈ, and in the effect plot (4th panel) all points lie on a horizontal line.
With ΞΎ=0.5: Y = (ΞΈ + 0.5Β·L)Β·A + Ξ³_YΒ·L + Ξ΅ β persons with higher L benefit more. A positive slope appears in the effect plot. Since with positive confounding the treated have higher L-values on average (they were more often treated): ATT > ATE > ATU. Same data, same method, three different correct answers β because three different population questions are asked.
Overlap (positivity assumption).
Overlap means: for every observed L-value there are both treated and untreated persons β 0 < P(A=1|L=l) < 1. This is visible in the first scatter plot: with good overlap, orange (A=1) and blue (A=0) points mix across the entire L-range. In the "No overlap" scenario the groups are almost completely separated β persons with high L are almost always treated.
This is a problem because G-Computation then extrapolates the outcome model into L-regions where no (or very few) controls were observed. The estimate becomes highly model-dependent β visible as a wider posterior and greater uncertainty.
When is G-Computation essential β and when does the regression coefficient suffice?
In the linear model without interaction (Y ~ A + L) the G-Computation ATE equals exactly the regression coefficient Ξ²Μ_A: linearity ensures that conditional and marginal effects coincide (without confounding, G-Computation is even identical to a t-test).
In the linear model with interaction (Y ~ A + L + A:L) this no longer holds in general: Ξ²Μ_A is only the effect at L=0, while G-Computation correctly averages over the actual L-distribution of the target population (ATE = Ξ²Μ_A + Ξ²Μ_{AL}Β·Δ[L]). Moreover, G-Computation computes ATT and ATU by restricting to the respective subpopulation β this is not possible with the coefficient alone.
Formula choice: Y ~ A + L versus Y ~ A * L
Standard DAGs encode causal structure (which variable influences which) β but not the functional form of those relationships. Whether the effect of A on Y differs across levels of L (effect heterogeneity, A:L interaction) is a substantive claim that goes beyond the graph: βIs it plausible that L moderates the effect of A?β
The tool therefore always specifies the outcome model as Y ~ A * L β a conservative default that allows heterogeneity without assuming it. Whether A:L is actually needed can be tested empirically with loo_compare(loo(fit_gc), loo(fit_add)) (see R code). If evidence for the interaction is lacking, Y ~ A + L is the more parsimonious model. The Golem Builder helps decide whether a moderation relationship is theoretically justified in the DAG.
G-Computation is essential for GLMs with non-linear link functions (e.g. logit for logistic regression, log for Poisson): there the coefficient is a conditional effect on the link scale (e.g. log-odds ratio). Due to the non-collapsibility of the logit link, conditional and marginal effects differ fundamentally β even without confounding. G-Computation delivers the marginal effect on the response scale (e.g. risk difference in probability points), which the coefficient alone cannot provide. avg_comparisons() in marginaleffects does exactly that, model-class-agnostically.
Extensive examples of G-Computation with brms for various GLMs (logistic, Poisson, multinomial, etc.) can be found on the blog of Solomon Kurz: Boost your power with baseline covariates.
OVB formula β why the naive estimator is biased.
The naive regression coefficient Ξ²Μnaiv from Y ~ A equals (in the population):
= ΞΈ + Ξ³Y · Cov(L, A) / Var(A)
= ΞΈ (true effect) + Ξ³Y · Ξ΄AβΌL (confounding bias)
Ξ΄AβΌL = Cov(L,A)/Var(A) is the slope coefficient from the auxiliary regression A ~ L β it measures how strongly L predicts treatment assignment. The product Ξ³Y · Ξ΄AβΌL is the Omitted Variable Bias: it vanishes exactly when either L does not affect the outcome (Ξ³Y=0) or L is independent of A (determined by Ξ³A) (Ξ΄AβΌL=0).
Sign of the bias. In this tool Ξ³_A (LβA) and Ξ³_Y (LβY) can be set independently, yielding the following sign structure:
| LβA | LβY | Bias = (LβY)×(LβA) | Naive vs. True |
|---|---|---|---|
| + | + | + (overestimation) | Naive larger |
| − | − | + (−×− = +) | Naive larger (often surprising!) |
| + | − | − (underestimation) | Naive smaller |
| − | + | − (underestimation) | Naive smaller (wrong sign possible!) |
Rule of thumb: Bias > 0 when both paths have the same sign (both + or both −). Bias < 0 when signs differ. In the tool Ξ³_A and Ξ³_Y can be set independently β same sign means positive bias, opposite signs mean negative bias.
Heckman decomposition: Naive comparison = ATE + Selection Bias + HTEB.
The naive comparison E[Y(1)|A=1] − E[Y(0)|A=0] generally does not equal the ATE (Cunningham, 2021; Morgan & Winship, 2007). The ATE is first a weighted sum of ATT and ATU (π = P(A=1)):
After algebraic rearrangement (add and subtract E[Y(0)|A=1]) the following holds directly:
+ {E[Y(0)|A=1] − E[Y(0)|A=0]} ← Selection Bias / Baseline Bias
+ (1−π) · (ATT − ATU) ← Heterogenous Treatment Effect Bias (HTEB)
Interpretation of the terms:
— Selection Bias / Baseline Bias = E[Y(0)|A=1] − E[Y(0)|A=0]: How would the two groups differ if there were no treatment from the start? It is simply a description of baseline differences between the two groups under the control condition.
— HTEB (Heterogenous Treatment Effect Bias) = (1−π)·(ATT−ATU): The expected difference in treatment effect between those in the treatment and control groups (multiplied by the population share). Arises when ξ≠0 in this tool.
When does Naive = ATE? Exactly when both terms vanish:
— Selection Bias = 0 ⟺ E[Y(0)|A=1] = E[Y(0)|A=0] ⟺ Y(0) ⊥ A (no baseline confounding)
— HTEB = 0 ⟺ ATT = ATU (no heterogeneity problem) or π = 1
In an RCT treatment is randomly assigned → Y(0) ⊥ A (Selection Bias = 0) and ATT ≈ ATU (HTEB ≈ 0) → Naive = ATE ✓
Source: Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. Freely available online β with many further techniques for causal analysis: IPW, Difference-in-Differences, Regression Discontinuity, Instrumental Variables and more. An excellent introduction to the full breadth of causal inference methods.