Maximum Likelihood
Likelihood Β· Log-Likelihood Β· MLE Β· Normal Β· Poisson Β· Bernoulli
Β© Dr. Rainer DΓΌsing Β· Interactive Tools by Claude
Maximum Likelihood β€” which parameter makes the observed data most plausible?  Β·  Likelihood β‰  Probability
Stage 1
Step 1 β€” Place data point
Step 2 β€” Slide distribution β†’ read off likelihood
Step 1: Set data point y (Οƒ = 1, fixed)
Data point y (observed, fixed) β€”
Distribution centered at ΞΌ β€”
Density value = Likelihood L(ΞΌ|y) β€”
Likelihood curve: how high is f(y|ΞΌ) for each ΞΌ?
Maximum at ΞΌ = β€” (= y)
Log-Likelihood β„“(ΞΌ|y) β€”
The right curve shows for every possible ΞΌ, how high the density function would be at data point y. The maximum is exactly at ΞΌ = y. That is MLE for one data point.
Οƒ = 1.0 (fixed)
β‘  Data point y 1.5
β‘‘ Slide distribution: ΞΌ -2.00
Likelihood β‰  Probability β€” the central distinction
P(y | ΞΈ): Probability of data y when parameter ΞΈ is known. ΞΈ fixed, y unknown.

L(ΞΈ | y): Likelihood of parameter ΞΈ when data y are observed. y fixed, ΞΈ unknown.

Same mathematical formula β€” completely different interpretation.

Likelihood does not sum to 1 and is not a probability over ΞΈ. For continuous distributions L is the density value β€” which can exceed 1 (set Οƒ very small in Stage 2 to see the density curve rise above 1).
Why Log-Likelihood? β€” Stage 1 & 2
The likelihood of n observations is the product of individual densities: L(ΞΈ|y₁…yβ‚™) = ∏ᡒ f(yα΅’|ΞΈ)

With many small values (e.g. 0.04 Γ— 0.09 Γ— … Γ— 0.06) the product becomes extremely small β€” numerical underflow. Stage 2 shows this live: the product of the red lines is displayed in the info text and shrinks rapidly.

The logarithm converts the product into a sum: β„“(ΞΈ) = Ξ£α΅’ log f(yα΅’|ΞΈ)

Why does this work? The logarithm is a monotonically increasing function β€” whenever L(ΞΈ) gets larger, log L(ΞΈ) also gets larger, and wherever L(ΞΈ) reaches its maximum, log L(ΞΈ) reaches its maximum at exactly the same point ΞΈΜ‚. It does not matter whether you maximize L(ΞΈ) or β„“(ΞΈ) β€” the answer is always identical.

The key practical advantage: instead of multiplying thousands of tiny numbers together (which causes numerical underflow β€” the computer simply returns 0), you now add manageable negative numbers. No loss of precision, no rounding errors.
MLE therefore always maximizes β„“(ΞΈ) = log L(ΞΈ).
MLE β€” the principle & estimators β€” Stage 1 & 2
Maximum Likelihood Estimation finds ΞΈ that makes the observed data most plausible:

ΞΈΜ‚ = argmax β„“(ΞΈ|y)

This is not a verdict about ΞΈ itself β€” only about its compatibility with the data. Other ΞΈ values are not impossible, just less plausible.

Analytical MLE estimators (Normal):
ΞΌΜ‚ = Θ³  Β·  ΟƒΜ‚Β² = Ξ£(yα΅’βˆ’Θ³)Β²/n  (biased β€” n not nβˆ’1!)

In Stage 1: ΞΌΜ‚ = y (one point); in Stage 2: ΞΌΜ‚ = Θ³ (all points).
Multidimensional MLE: ΞΌ and Οƒ simultaneously β€” Stage 2
MLE can estimate multiple parameters simultaneously β€” a key advantage over simple moment estimators.

For the Normal distribution there are two unknown parameters: ΞΌ (location) and Οƒ (spread). The log-likelihood becomes a surface over the (ΞΌ, Οƒ)-space β€” a mountain landscape instead of a curve. The joint MLE lies at the peak:

ΞΌΜ‚ = Θ³    ΟƒΜ‚ = √(Ξ£(yα΅’βˆ’Θ³)Β²/n)

In Stage 2: activate "Vary Οƒ too" β†’ the heatmap shows the 2D landscape. The bright spot is the joint peak. This principle scales to arbitrarily many parameters β€” this is how lm() and glm() in R work internally.
MLE is general β€” Poisson & Bernoulli β€” Stage 3
MLE works for any parametric family β€” not just Normal. This is the foundation of GLMs:

Poisson (training hours/week): Ξ»Μ‚ = Θ³  Β·  Log-likelihood: Ξ£α΅’ [yα΅’Β·log(Ξ») βˆ’ Ξ»]
Bernoulli (min. performance 0/1): pΜ‚ = proportion of ones  Β·  Log-likelihood: Ξ£α΅’ [yα΅’Β·log(p) + (1βˆ’yα΅’)Β·log(1βˆ’p)]

Switch between families in Stage 3: the likelihood landscape changes its shape β€” parabola for Bernoulli, asymmetric for Poisson β€” but the mechanism is identical: find the peak.
Model comparison with AIC & BIC β€” Stage 3
The maximized log-likelihood value β„“Μ‚ can be used directly for model comparison:

AIC = βˆ’2Β·β„“Μ‚ + 2k  (k = number of parameters)
BIC = βˆ’2Β·β„“Μ‚ + kΒ·log(n)

Smaller AIC/BIC = better fit at equal complexity. AIC penalizes less strongly than BIC β€” with large n, BIC favors more parsimonious models.

Application example: Does Poisson or Negative-Binomial fit the training-hours data better? Same data, different families β€” AIC/BIC decide. In Stage 3, AIC and BIC are computed live.
β„Ή Maximum Likelihood β€” Help
What will I learn here?
This tool explains Maximum Likelihood Estimation (MLE) β€” one of the most important estimation mechanisms in statistics. And it makes a crucial distinction clear: likelihood is not probability.
The three stages
Likelihood β‰  Probability
Probability: Parameters fixed β†’ how probable are these data?
Likelihood: Data fixed, observed β†’ how plausible is this parameter?

Likelihood is not a probability distribution over the parameter β€” it does not integrate to 1. Only a prior turns it into a posterior (Bayes theorem).
Why this matters for Bayes
MLE delivers the most plausible parameter without prior information. Bayesian estimation weights this likelihood with a prior: Posterior ∝ Likelihood Γ— Prior. With a flat prior β†’ posterior mode β‰ˆ MLE. This makes MLE the conceptual foundation for everything that follows.
Next β†’ From LM to GLM: link functions and why GLMs apply the same MLE logic to other distributions