Maximum Likelihood — Bayes Thinking Lab

Maximum Likelihood — which parameter makes the observed data most plausible? · Likelihood ≠ Probability

Stage 1

      Step 1 — Place data point
    

      Step 2 — Slide distribution → read off likelihood
    

Step 1: Set data point y (σ = 1, fixed)

Data point y (observed, fixed) —

Distribution centered at μ —

Density value = Likelihood L(μ|y) —

Likelihood curve: how high is f(y|μ) for each μ?

Maximum at μ = — (= y)

Log-Likelihood ℓ(μ|y) —

          The right curve shows for every possible μ, how high the density function
          would be at data point y. The maximum is exactly at μ = y.
          That is MLE for one data point.
        

      σ = 1.0 (fixed)
    

① Data point y 1.5

② Slide distribution: μ -2.00

Likelihood ≠ Probability — the central distinction

P(y | θ): Probability of data y when parameter θ is known. θ fixed, y unknown.

L(θ | y): Likelihood of parameter θ when data y are observed. y fixed, θ unknown.

Same mathematical formula — completely different interpretation.

Likelihood does not sum to 1 and is not a probability over θ. For continuous distributions L is the density value — which can exceed 1 (set σ very small in Stage 2 to see the density curve rise above 1).

Why Log-Likelihood? — Stage 1 & 2

The likelihood of n observations is the product of individual densities: L(θ|y₁…yₙ) = ∏ᵢ f(yᵢ|θ)

With many small values (e.g. 0.04 × 0.09 × … × 0.06) the product becomes extremely small — numerical underflow. Stage 2 shows this live: the product of the red lines is displayed in the info text and shrinks rapidly.

The logarithm converts the product into a sum: ℓ(θ) = Σᵢ log f(yᵢ|θ)

Why does this work? The logarithm is a monotonically increasing function — whenever L(θ) gets larger, log L(θ) also gets larger, and wherever L(θ) reaches its maximum, log L(θ) reaches its maximum at exactly the same point θ̂. It does not matter whether you maximize L(θ) or ℓ(θ) — the answer is always identical.

The key practical advantage: instead of multiplying thousands of tiny numbers together (which causes numerical underflow — the computer simply returns 0), you now add manageable negative numbers. No loss of precision, no rounding errors.
MLE therefore always maximizes ℓ(θ) = log L(θ).

MLE — the principle & estimators — Stage 1 & 2

Maximum Likelihood Estimation finds θ that makes the observed data most plausible:

θ̂ = argmax ℓ(θ|y)

This is not a verdict about θ itself — only about its compatibility with the data. Other θ values are not impossible, just less plausible.

Analytical MLE estimators (Normal):
μ̂ = ȳ · σ̂² = Σ(yᵢ−ȳ)²/n (biased — n not n−1!)

In Stage 1: μ̂ = y (one point); in Stage 2: μ̂ = ȳ (all points).

Multidimensional MLE: μ and σ simultaneously — Stage 2

MLE can estimate multiple parameters simultaneously — a key advantage over simple moment estimators.

For the Normal distribution there are two unknown parameters: μ (location) and σ (spread). The log-likelihood becomes a surface over the (μ, σ)-space — a mountain landscape instead of a curve. The joint MLE lies at the peak:

μ̂ = ȳ σ̂ = √(Σ(yᵢ−ȳ)²/n)

In Stage 2: activate "Vary σ too" → the heatmap shows the 2D landscape. The bright spot is the joint peak. This principle scales to arbitrarily many parameters — this is how lm() and glm() in R work internally.

MLE is general — Poisson & Bernoulli — Stage 3

MLE works for any parametric family — not just Normal. This is the foundation of GLMs:

Poisson (training hours/week): λ̂ = ȳ · Log-likelihood: Σᵢ [yᵢ·log(λ) − λ]
Bernoulli (min. performance 0/1): p̂ = proportion of ones · Log-likelihood: Σᵢ [yᵢ·log(p) + (1−yᵢ)·log(1−p)]

Switch between families in Stage 3: the likelihood landscape changes its shape — parabola for Bernoulli, asymmetric for Poisson — but the mechanism is identical: find the peak.

Model comparison with AIC & BIC — Stage 3

The maximized log-likelihood value ℓ̂ can be used directly for model comparison:

AIC = −2·ℓ̂ + 2k (k = number of parameters)
BIC = −2·ℓ̂ + k·log(n)

Smaller AIC/BIC = better fit at equal complexity. AIC penalizes less strongly than BIC — with large n, BIC favors more parsimonious models.

Application example: Does Poisson or Negative-Binomial fit the training-hours data better? Same data, different families — AIC/BIC decide. In Stage 3, AIC and BIC are computed live.