Data Creator – Bayes Thinking Lab

Data Creator — Help & Documentation

What is the Data Creator? ▸

The Data Creator generates parametric example datasets for teaching and simulation purposes. You specify the design (between and/or within factors, likelihood, cluster structure) and enter theoretical means directly — the code handles data generation.

Output: CSV export (wide & long), complete R code (with faux::sim_design, glmmTMB, power simulation), and an interactive preview.

The Data Creator is not an analysis tool — it supports study design, power analysis, and prior predictive checking.

Between vs. Within Factors ▸

Between-subjects factor: Each participant belongs to exactly one group (e.g. control vs. experimental). Variance arises between persons.

Within-subjects factor: The same person is measured repeatedly (e.g. time points t1, t2, t3). Repeated measurements create correlation between levels. The Data Creator currently supports one within factor.

The combination of both is called a mixed design (= split-plot). In brms: bf(Y ~ group * time + (1 | id))

Level names: Enter comma-separated, e.g. A, B, C. Spaces are automatically converted to underscores.

Likelihood & Model Scale (η) ▸

All means are entered on the model scale η (linear predictor):

Gaussian: η = E[Y] directly (identity link)
Poisson / NegBin: η = log(E[Y]) → E[Y] = exp(η)
Bernoulli / Beta / Binomial: η = logit(p) → p = 1/(1+exp(−η))
Log-Normal: η = E[log(Y)] → Median = exp(η)
Gamma: η = log(E[Y]), φ = shape (variability)
Student-t: η = E[Y], ν = degrees of freedom (smaller ν = heavier tails)

The ⇄ Response scale button allows input on the response scale E[Y] — internally everything is converted to η.

The Scale Converter (below the likelihood selector) shows the η ↔ E[Y] conversion interactively for any value.

Mean Grid & Standard Deviations ▸

The mean grid shows all design cells. A 2×3 design (2 between levels × 3 within levels) yields 6 cells.

The SD row sets the residual spread per cell — on the model scale (σ for Gaussian, used internally for other likelihoods).

For non-Gaussian likelihoods, σ does not directly control the variance of the target distribution, but the variance of the normal approximation at the η level from which the target distribution is then transformed.

Correlation Matrix (Within Design) ▸

When a within factor is active, correlated repeated measurements arise. The correlation matrix (tab 2) determines how strongly the time points are related.

Compound Symmetry (CS): All off-diagonal entries equal ρ (simplest model)
AR(1): ρ^|j-k| — adjacent time points correlate more strongly (typical for longitudinal data)
Custom: Free entry of each matrix element (must be positive definite)

The red border in Custom mode indicates the matrix is not positive definite and does not yield a valid covariance matrix.

For non-Gaussian likelihoods: The correlation structure acts as a Gaussian approximation at the η level — acceptable for moderate correlations.

Cluster / Random Effects (τ₀, τ₁, ρ₀₁) ▸

The cluster toggle adds a grouping structure (e.g. students in classes). Each class receives a random intercept u₀ⱼ ~ N(0, τ₀²).

τ₀ (random intercept SD): How much do classes differ in their mean level? Larger τ₀ = more heterogeneity between clusters.

τ₁ (random slope SD): If τ₁ > 0, each class also gets a random slope for the chosen predictor. Formula: (1 + method | class)

ρ₀₁ (intercept–slope correlation): Correlation between u₀ and u₁. Positive means: classes with a high baseline also show stronger effects. Generated via Cholesky decomposition.

Slope on: Choose which predictor receives the random slope — between or within factor possible. This determines the brms formula.

ICC: Intraclass correlation = τ₀²/(τ₀²+σ²). Typical values: schools ~0.10–0.20, families ~0.15–0.45.

Covariates (Continuous Predictors) ▸

Covariates are continuous predictors generated in addition to between/within factors (max. 2). They appear as columns in the CSV and in the formula preview.

μ / σ: Global mean and standard deviation of the covariate.

μ per group: If between factors are present, the mean can vary by group — creating confounding between covariate and factor (typical use case for G-computation).

ρ with cov1: If 2 covariates are defined, they can be generated correlated via Cholesky decomposition. The R code uses MASS::mvrnorm().

Covariates are always between-subjects (time-invariant): in within designs, one column per subject is automatically replicated by pivot_longer().
Covariate names are sanitized in R code via cleanR() (spaces → underscores, special chars removed).

Formula & Model Specification ▸

The formula preview (tab 1) shows lme4- and brms-compatible formulas. With multiple factors, the full interaction model (A * B * W) is automatically suggested — the most robust model for repeated measures.

A * B expands to A + B + A:B (main effects + interaction). A * B * W yields all two-way and the three-way interaction.

The random effect term (1 | class) appears automatically when cluster is active. With random slope: (1 + time | class).

Visualization & Plot Options ▸

Raw data / Means: Shows individual points (subsampled), boxplots (between) or mean lines (within) + spaghetti lines per cluster.

Effect plot: Shows median (50th pct.) + band from 16th to 84th pct. (~±1 SD for normal distribution). Cleaner, but without raw data.

Group selector: With multiple between factors, choose which factor determines colors/lines. "All combinations" shows all cells simultaneously.

X-axis selector (between-only, ≥2 between factors): Which factor goes on the X axis? The group selector then determines the color grouping.

R Code Export & faux::sim_design() ▸

The generated R code is fully reproducible and uses:

faux::sim_design() (DeBruine & Barr 2021) for within designs — generates correlated normal distributions with the specified covariance structure
base R (rnorm, rbinom, …) for simple between-/no-factor designs
MASS::mvrnorm() for correlated random effects (τ₀, τ₁) and correlated covariates

Non-Gaussian likelihoods: sim_design() always generates normally distributed η values. These are then transformed via mutate(across(...)) into the target distribution (e.g. rbinom(n(), 1, plogis(eta)) for Bernoulli). This is a Gaussian approximation at the η level — well suited for moderate correlations and not too extreme distribution parameters.

Wide → Long: Within designs produce dat_wide (one column per time point). pivot_longer() converts it to dat_long — the format expected by brms/lme4. Covariates (between-subjects) are automatically replicated.

empirical = TRUE: When exact sample statistics rather than random draws are desired — useful when JS visualization and R code should match.

Package installation:

install.packages("faux")
install.packages("glmmTMB")
install.packages("MASS") # usually pre-installed
install.packages("future.apply") # for power parallelization

Power Simulation (Commented Block) ▸

At the end of each generated R code there is a complete, commented-out power simulation block. Remove the # characters and the block runs directly in R.

Block structure:

library(glmmTMB) + library(car) — packages for GLMMs and omnibus tests
N_sim <- 500 — simulation replications (500–1000 for stable estimates)
sim_data_one(seed) — generates a dataset with your parameters; within designs include (1|id)
fit_and_test(seed) — fits full model, returns named logical vector (one value per term)
sapply() → matrix (rows = terms, cols = sims); rowMeans() = power per term

Two test variants in the block:

Option A (default): car::Anova(type="II") — Wald omnibus tests for all terms in one model fit. Fast, one fit per sim. Reports power per term. When interactions are present and main effects should be interpreted conditionally: options(contrasts = c("contr.sum", "contr.poly")) + type="III" (see comment in code).
Option B (at end, commented): LRT via anova() — likelihood ratio tests via model comparison. Statistically more accurate (esp. small N and non-Gaussian), but ~3–6× slower. Includes:
- Interaction test: full model vs. additive
- Type-II main effects: additive vs. additive-without-X (controlling for all others)
- Each factor vs. null model (Y ~ 1 + RE)
- Additive model vs. null (all main effects combined)

Within design and ID variable: For within factors, the power formula always includes (1 | id) to correctly model repeated measurements. The id comes from faux::sim_design() and survives the pivot_longer() step.

glmmTMB families — differences from brms/lme4:

Student-t: t_family() — not gaussian()!
Log-Normal: lognormal() — not gaussian(link="log")!
Gamma: Gamma(link = "log")
Beta: beta_family(link = "logit")
Neg. Binomial: nbinom2(link = "log") (NB2 parameterization)

Parallelization: library(future.apply); plan(multisession) → future_sapply() instead of sapply(). Speedup ~3–6× depending on machine.

Interpretation: Power of 80% = in 80 out of 100 replications of the experiment, the effect would be detected as significant given your chosen parameters.

Frequently Asked Questions & Pitfalls ▸

Why don't JS simulation and R code match exactly?
JS uses mulberry32 as PRNG, R uses the Mersenne Twister. Same seed → similar, but not identical random numbers. For exact agreement: empirical = TRUE in faux::sim_design() — then sample statistics match the specified values exactly.

Why do factor names appear differently in R code?
Names with spaces or special characters (e.g. "Reaction Time (ms)") are automatically sanitized: spaces/hyphens → underscores, remaining special characters removed. The sanitized names are also used in CSV headers and the formula preview.

Covariates in within designs (Wide → Long):
Covariates are between-subjects and therefore appear as one column in wide format. After pivot_longer(), the covariate is automatically replicated for each measurement — this is the correct long format for brms/lme4.

Non-Gaussian likelihoods and correlation matrix:
faux::sim_design() always generates normally distributed η values internally (linear predictor). The correlation structure (tab 2) acts at this η level. After transformation to the target distribution (e.g. Poisson, Beta), the Pearson correlation between within levels is lower than ρ. For strongly non-normal distributions (e.g. Bernoulli with extreme p), the difference can be substantial.

Gamma parameterization:
The Data Creator uses Gamma(μ, φ) with log link: μ = exp(η), shape = φ, rate = φ/μ. This matches brms' Gamma(link="log") parameterization. Variance = μ²/φ — larger φ means smaller relative variability.

Beta parameterization:
Beta(μ, φ) with μ = logit⁻¹(η), α = μ·φ, β = (1−μ)·φ. φ is the precision parameter: larger φ = narrower distribution around μ. Typical range for φ with Likert data: 5–20.

Computing ICC:
ICC = τ₀² / (τ₀² + σ²). For Gaussian design: τ₀ = 0.5, σ = 1 → ICC ≈ 0.20. Typical benchmarks: schools ~0.10–0.20, families ~0.20–0.40, measurements within persons ~0.50–0.70.

Sample

n per cell seed

Likelihood (Outcome)

Between Factors (groups)

Within Factor (repeated measures)

Covariates (continuous, max. 2)

Cluster (hierarchical)

Enable cluster structure

Generate

Enter means on response scale (inverse link)

Add factors to configure the mean grid.

No within factor defined. Correlation matrix not available.

Cluster structure not activated.

Model Formula Preview

lme4:Y ~ 1

brms:Y ~ 1