Maximum Likelihood
Likelihood · Log-Likelihood · MLE · Normal · Poisson · Bernoulli
© Dr. Rainer Düsing · Interactive Tools by Claude
The core idea: Maximum Likelihood
We have observed data — they are fixed and unchanging. What we do not know is the underlying parameter (e.g. the true mean μ of the population).

What is a density function? The curve f(y|μ) shows for every possible value y how "densely" the probability mass is concentrated — how typical that value would be given μ. High density at a location means: that value is well compatible with μ.

What is likelihood? It reverses the question: not "how probable is y if μ is known?", but "which μ makes the observed y most plausible?". We slide the imagined distribution across the value range and read off the density value at the observed data point — that is the likelihood L(μ|y). The maximum of this function is called the MLE: Maximum Likelihood Estimator.
Likelihood ≠ probability: L does not sum to 1 and can exceed 1. Detailed comparison → learn cards below.
Running Example
Stage 1 & 2 — Normal
Long jump distance (m)
You measure the long jump performance of sports science students. Values scatter approximately normally around a true mean μ. How large is μ — and which μ makes your concrete measurements most plausible?
Stage 3 — Poisson
Training hours per week
Integer count data (0, 1, 2, …). The Poisson distribution models how many units occur per time period. Parameter λ = expected mean. MLE: λ̂ = ȳ — identical principle, different formula.
Stage 3 — Bernoulli
Minimum performance achieved? (0/1)
Binary outcome: passed or failed. Parameter p = success probability. MLE: p̂ = proportion of ones. The likelihood landscape becomes a parabola over p ∈ (0, 1).
Stage 1
Step 1 — Place data point
Step 2 — Slide distribution → read off likelihood
Step 1: Set data point y (σ = 1, fixed)
Data point y (observed, fixed)
Distribution centered at μ
Density value = Likelihood L(μ|y)
Likelihood curve: how high is f(y|μ) for each μ?
Maximum at μ = — (= y)
Log-Likelihood ℓ(μ|y)
The right curve shows for every possible μ, how high the density function would be at data point y. The maximum is exactly at μ = y. That is MLE for one data point.
σ = 1.0 (fixed)
① Data point y 1.5
② Slide distribution: μ -2.00
Likelihood ≠ Probability — the central distinction
P(y | θ): Probability of data y when parameter θ is known. θ fixed, y unknown.

L(θ | y): Likelihood of parameter θ when data y are observed. y fixed, θ unknown.

Same mathematical formula — completely different interpretation.

Likelihood does not sum to 1 and is not a probability over θ. For continuous distributions L is the density value — which can exceed 1 (set σ very small in Stage 1).
Why Log-Likelihood? — Stage 1 & 2
The likelihood of n observations is the product of individual densities: L(θ|y₁…yₙ) = ∏ᵢ f(yᵢ|θ)

With many small values (e.g. 0.04 × 0.09 × … × 0.06) the product becomes extremely small — numerical underflow. Stage 2 shows this live: the product of the red lines is displayed in the info text and shrinks rapidly.

The logarithm converts the product into a sum: ℓ(θ) = Σᵢ log f(yᵢ|θ)

Since log is monotonically increasing: same peak, numerically stable. MLE always maximizes ℓ(θ).
MLE — the principle & estimators — Stage 1 & 2
Maximum Likelihood Estimation finds θ that makes the observed data most plausible:

θ̂ = argmax ℓ(θ|y)

This is not a verdict about θ itself — only about its compatibility with the data. Other θ values are not impossible, just less plausible.

Analytical MLE estimators (Normal):
μ̂ = ȳ  ·  σ̂² = Σ(yᵢ−ȳ)²/n  (biased — n not n−1!)

In Stage 1: μ̂ = y (one point); in Stage 2: μ̂ = ȳ (all points).
Multidimensional MLE: μ and σ simultaneously — Stage 2
MLE can estimate multiple parameters simultaneously — a key advantage over simple moment estimators.

For the Normal distribution there are two unknown parameters: μ (location) and σ (spread). The log-likelihood becomes a surface over the (μ, σ)-space — a mountain landscape instead of a curve. The joint MLE lies at the peak:

μ̂ = ȳ    σ̂ = √(Σ(yᵢ−ȳ)²/n)

In Stage 2: activate "Vary σ too" → the heatmap shows the 2D landscape. The bright spot is the joint peak. This principle scales to arbitrarily many parameters — this is how lm() and glm() in R work internally.
MLE is general — Poisson & Bernoulli — Stage 3
MLE works for any parametric family — not just Normal. This is the foundation of GLMs:

Poisson (training hours/week): λ̂ = ȳ  ·  Log-likelihood: Σᵢ [yᵢ·log(λ) − λ]
Bernoulli (min. performance 0/1): p̂ = proportion of ones  ·  Log-likelihood: Σᵢ [yᵢ·log(p) + (1−yᵢ)·log(1−p)]

Switch between families in Stage 3: the likelihood landscape changes its shape — parabola for Bernoulli, asymmetric for Poisson — but the mechanism is identical: find the peak.
Model comparison with AIC & BIC — Stage 3
The maximized log-likelihood value ℓ̂ can be used directly for model comparison:

AIC = −2·ℓ̂ + 2k  (k = number of parameters)
BIC = −2·ℓ̂ + k·log(n)

Smaller AIC/BIC = better fit at equal complexity. AIC penalizes less strongly than BIC — with large n, BIC favors more parsimonious models.

Application example: Does Poisson or Negative-Binomial fit the training-hours data better? Same data, different families — AIC/BIC decide. In Stage 3, AIC and BIC are computed live.
ℹ Maximum Likelihood — Help
What will I learn here?
This tool explains Maximum Likelihood Estimation (MLE) — one of the most important estimation mechanisms in statistics. And it makes a crucial distinction clear: likelihood is not probability.
The three stages
Likelihood ≠ Probability
Probability: Parameters fixed → how probable are these data?
Likelihood: Data fixed, observed → how plausible is this parameter?

Likelihood is not a probability distribution over the parameter — it does not integrate to 1. Only a prior turns it into a posterior (Bayes theorem).
Why this matters for Bayes
MLE delivers the most plausible parameter without prior information. Bayesian estimation weights this likelihood with a prior: Posterior ∝ Likelihood × Prior. With a flat prior → posterior mode ≈ MLE. This makes MLE the conceptual foundation for everything that follows.
Next → From LM to GLM: link functions and why GLMs apply the same MLE logic to other distributions