LOO-CV asks: How well does my model predict a data point it has never seen? That is the most honest question you can ask a model β€” and it exposes overfitting that stays invisible in-sample.
Overview
Full-data fit Model A LOO fit Model B LOO fit Data point LOO prediction Ε·
Model A Linear (degree 1)
elpdloo (A) β€”
Model B Polynomial (degree 4)
elpdloo (B) β€”
Why fit a model if you only evaluate it on training data? LOO asks more honestly: how well does the model generalise to points it never saw? Click Next β†’ or start the animation to step through the data point by point.
Important: In brms, LOO is written directly into the model object β€” model.1 <- add_criterion(model.1, c("loo")) β€” and is then available for comparison: loo_compare(model.1, model.2, model.3). Paste that output below.
Input
Threshold Explorer β€” When is a difference reliable?
–5.0
2.5
The three decision scenarios
The difference exceeds its own uncertainty. This is a reliable finding.

Example: elpd_diff = βˆ’8.7, SE = 3.4 β†’ ratio = 2.56 β†’ Model A is clearly better.

Recommendation: Use the better model. Report elpd_diff and SE in your paper. Ask substantively why the model fits better β€” that is the real question.
The models cannot be clearly separated. This is not a failure β€” it means both explain the data similarly well. The measurement uncertainty of LOO itself does not permit a clear decision.

Recommendation: Choose by parsimony (simpler model) or substantive grounds. Report the similarity transparently. Consider model stacking via loo_model_weights().
High Pareto-k values indicate influential observations β€” the PSIS approximation is unreliable there. The LOO result itself should be interpreted with caution.

Recommendation: Refit with reloo() for the problematic observations (exact LOO, costs N additional fits). Check whether single outliers dominate the model. Report the proportion of problematic k values.
WAIC vs. LOO

WAIC and LOO-CV estimate the same target: out-of-sample predictive accuracy. LOO via PSIS is more robust and is today's standard (Vehtari et al. 2017). The numerical difference between them is usually small in practice β€” as long as there are no extreme k values. Use LOO, and check the Pareto-k diagnostics as a quality control.

brms code for the LOO workflow
# 1. Fit models (save_pars required for LOO/WAIC)
model.1 <- brm(y ~ x, data = d, save_pars = save_pars(all = TRUE))
model.2 <- brm(y ~ x + z, data = d, save_pars = save_pars(all = TRUE))
model.3 <- brm(y ~ x + z + w, data = d, save_pars = save_pars(all = TRUE))

# 2. Write LOO directly into the model object (no separate loo object needed)
model.1 <- add_criterion(model.1, c("loo"))
model.2 <- add_criterion(model.2, c("loo"))
model.3 <- add_criterion(model.3, c("loo"))

# 3. Compare models β€” LOO is read directly from the objects
loo_compare(model.1, model.2, model.3) # β†’ elpd_diff, se_diff ← paste this output into Stage 2

# 4. Check Pareto-k diagnostics for a single model
print(model.1$criteria$loo) # ← paste this output optionally into Stage 2

# 5. Model Stacking β€” ensemble weights instead of choosing one model (optional)
# Asks: which combination of models predicts new data best?
# Weights sum to 1; model with w β‰ˆ 0 contributes nothing.
loo_model_weights(model.1, model.2, model.3) # β†’ e.g. model.1=0.12, model.2=0.73, model.3=0.15

# 6. For high k values: reloo() directly into the object
model.1 <- add_criterion(model.1, c("loo"), reloo = TRUE, k_threshold = 0.7)
Next step: A model has been selected β€” now it is time to interpret the posterior substantively. β†’ Decision Lab
← Prior Predictive Check Decision Lab β†’
LOO Lab β€” Help
What does this tool do?
LOO-CV (Leave-One-Out Cross-Validation) measures the out-of-sample predictive accuracy of a Bayesian model. The tool has three stages: concept β†’ analyse R output β†’ make a decision.

Note: This tool does not compute LOO β€” that is not feasible in a browser. You compute LOO in R with add_criterion() and paste the output into Stage 2.
Stage 1 β€” The concept
Model A (linear) vs. Model B (polynomial degree 4) on the same dataset. Each step: one point is held out, both models are refitted on Nβˆ’1 points, then predictions are compared. Dashed line = LOO fit; β—† = LOO prediction; arrow = residual.
Stage 2 β€” R output
In R, LOO is written directly into the model object:
model.1 <- add_criterion(model.1, c("loo"))
Then: loo_compare(model.1, model.2, ...) β†’ paste output into the first field.
The second field (optional) takes the output of print(model.1$criteria$loo) for the Pareto-k diagnostics of a single model. Load example shows the expected format.
Model Stacking
Model Stacking is a form of model averaging β€” but without model priors. Instead of choosing one model, an ensemble is formed:

ΕΆ = w₁·Ţ₁ + wβ‚‚Β·ΕΆβ‚‚ + w₃·Ţ₃ (weights sum to 1)

Weights are optimised directly from LOO predictive performance (Yao et al. 2018). Redundant models automatically receive weight β‰ˆ 0.

vs. classical model averaging (BMA):
BMA weights by model posterior probability β€” stacking by out-of-sample predictive accuracy. Stacking is more robust against similar models and prior-sensitive BMA problems (McElreath Ch. 7).
Decision rule
|elpd_diff| > 2Β·SE β†’ clear difference
|elpd_diff| < 2Β·SE β†’ practically indistinguishable
k > 0.7 β†’ LOO estimate unreliable, reloo() recommended
Prerequisites
brms Model Builder (model structure) Β· Posterior PPC (model diagnostics)