LOO Lab — Model Comparison · Bayes Thinking Lab

LOO-CV asks: How well does my model predict a data point it has never seen? That is the most honest question you can ask a model — and it exposes overfitting that stays invisible in-sample.

Overview

Full-data fit Model A LOO fit Model B LOO fit Data point LOO prediction ŷ

Model A Linear (degree 1)

elpd_loo (A) —

Model B Polynomial (degree 4)

elpd_loo (B) —

Why fit a model if you only evaluate it on training data? LOO asks more honestly: how well does the model generalise to points it never saw? Click Next → or start the animation to step through the data point by point.

Important: In brms, LOO is written directly into the model object — model.1 <- add_criterion(model.1, c("loo")) — and is then available for comparison: loo_compare(model.1, model.2, model.3). Paste that output below.

Input

loo_compare() output from R — required

print(model.1$criteria$loo) — optional, Pareto-k of a single model

Threshold Explorer — When is a difference reliable?

elpd_diff –5.0

se_diff 2.5

The three decision scenarios

The difference exceeds its own uncertainty. This is a reliable finding.

Example: elpd_diff = −8.7, SE = 3.4 → ratio = 2.56 → Model A is clearly better.

Recommendation: Use the better model. Report elpd_diff and SE in your paper. Ask substantively why the model fits better — that is the real question.

The models cannot be clearly separated. This is not a failure — it means both explain the data similarly well. The measurement uncertainty of LOO itself does not permit a clear decision.

Recommendation: Choose by parsimony (simpler model) or substantive grounds. Report the similarity transparently. Consider model stacking via loo_model_weights().

High Pareto-k values indicate influential observations — the PSIS approximation is unreliable there. The LOO result itself should be interpreted with caution.

Recommendation: Refit with reloo() for the problematic observations (exact LOO, costs N additional fits). Check whether single outliers dominate the model. Report the proportion of problematic k values.

WAIC vs. LOO

WAIC and LOO-CV estimate the same target: out-of-sample predictive accuracy. LOO via PSIS is more robust and is today's standard (Vehtari et al. 2017). The numerical difference between them is usually small in practice — as long as there are no extreme k values. Use LOO, and check the Pareto-k diagnostics as a quality control.

brms code for the LOO workflow

# 1. Fit models (save_pars required for LOO/WAIC)
model.1 <- brm(y ~ x, data = d, save_pars = save_pars(all = TRUE))
model.2 <- brm(y ~ x + z, data = d, save_pars = save_pars(all = TRUE))
model.3 <- brm(y ~ x + z + w, data = d, save_pars = save_pars(all = TRUE))

# 2. Write LOO directly into the model object (no separate loo object needed)
model.1 <- add_criterion(model.1, c("loo"))
model.2 <- add_criterion(model.2, c("loo"))
model.3 <- add_criterion(model.3, c("loo"))

# 3. Compare models — LOO is read directly from the objects
loo_compare(model.1, model.2, model.3) # → elpd_diff, se_diff ← paste this output into Stage 2

# 4. Check Pareto-k diagnostics for a single model
print(model.1$criteria$loo) # ← paste this output optionally into Stage 2

# 5. Model Stacking — ensemble weights instead of choosing one model (optional)
# Asks: which combination of models predicts new data best?
# Weights sum to 1; model with w ≈ 0 contributes nothing.
loo_model_weights(model.1, model.2, model.3) # → e.g. model.1=0.12, model.2=0.73, model.3=0.15

# 6. For high k values: reloo() directly into the object
model.1 <- add_criterion(model.1, c("loo"), reloo = TRUE, k_threshold = 0.7)

Next step: A model has been selected — now it is time to interpret the posterior substantively. → Decision Lab

← Prior Predictive Check Decision Lab →

LOO Lab — Help

What does this tool do?

LOO-CV (Leave-One-Out Cross-Validation) measures the out-of-sample predictive accuracy of a Bayesian model. The tool has three stages: concept → analyse R output → make a decision.

Note: This tool does not compute LOO — that is not feasible in a browser. You compute LOO in R with add_criterion() and paste the output into Stage 2.

Stage 1 — The concept

Model A (linear) vs. Model B (polynomial degree 4) on the same dataset. Each step: one point is held out, both models are refitted on N−1 points, then predictions are compared. Dashed line = LOO fit; ◆ = LOO prediction; arrow = residual.

Stage 2 — R output

In R, LOO is written directly into the model object:
model.1 <- add_criterion(model.1, c("loo"))
Then: loo_compare(model.1, model.2, ...) → paste output into the first field.
The second field (optional) takes the output of print(model.1$criteria$loo) for the Pareto-k diagnostics of a single model. Load example shows the expected format.

Model Stacking

Model Stacking is a form of model averaging — but without model priors. Instead of choosing one model, an ensemble is formed:

Ŷ = w₁·Ŷ₁ + w₂·Ŷ₂ + w₃·Ŷ₃ (weights sum to 1)

Weights are optimised directly from LOO predictive performance (Yao et al. 2018). Redundant models automatically receive weight ≈ 0.

vs. classical model averaging (BMA):
BMA weights by model posterior probability — stacking by out-of-sample predictive accuracy. Stacking is more robust against similar models and prior-sensitive BMA problems (McElreath Ch. 7).

Decision rule

|elpd_diff| > 2·SE → clear difference
|elpd_diff| < 2·SE → practically indistinguishable
k > 0.7 → LOO estimate unreliable, reloo() recommended

Prerequisites

brms Model Builder (model structure) · Posterior PPC (model diagnostics)