Maximum Likelihood β which parameter makes the observed data most plausible?
Β· Likelihood β Probability
Stage 1
Step 1 β Place data point
Step 2 β Slide distribution β read off likelihood
Step 1: Set data point y (Ο = 1, fixed)
Data point y (observed, fixed)
β
Distribution centered at ΞΌ
β
Density value = Likelihood L(ΞΌ|y)
β
Likelihood curve: how high is f(y|ΞΌ) for each ΞΌ?
Maximum at ΞΌ =
β (= y)
Log-Likelihood β(ΞΌ|y)
β
The right curve shows for every possible ΞΌ, how high the density function
would be at data point y. The maximum is exactly at ΞΌ = y.
That is MLE for one data point.
Likelihood β Probability β the central distinction
P(y | ΞΈ): Probability of data y when parameter ΞΈ is known.
ΞΈ fixed, y unknown.
L(ΞΈ | y): Likelihood of parameter ΞΈ when data y are observed. y fixed, ΞΈ unknown.
Same mathematical formula β completely different interpretation.
Likelihood does not sum to 1 and is not a probability over ΞΈ. For continuous distributions L is the density value β which can exceed 1 (set Ο very small in Stage 2 to see the density curve rise above 1).
L(ΞΈ | y): Likelihood of parameter ΞΈ when data y are observed. y fixed, ΞΈ unknown.
Same mathematical formula β completely different interpretation.
Likelihood does not sum to 1 and is not a probability over ΞΈ. For continuous distributions L is the density value β which can exceed 1 (set Ο very small in Stage 2 to see the density curve rise above 1).
Why Log-Likelihood? β Stage 1 & 2
The likelihood of n observations is the product of individual densities:
With many small values (e.g. 0.04 Γ 0.09 Γ β¦ Γ 0.06) the product becomes extremely small β numerical underflow. Stage 2 shows this live: the product of the red lines is displayed in the info text and shrinks rapidly.
The logarithm converts the product into a sum:
Why does this work? The logarithm is a monotonically increasing function β whenever L(ΞΈ) gets larger, log L(ΞΈ) also gets larger, and wherever L(ΞΈ) reaches its maximum, log L(ΞΈ) reaches its maximum at exactly the same point ΞΈΜ. It does not matter whether you maximize L(ΞΈ) or β(ΞΈ) β the answer is always identical.
The key practical advantage: instead of multiplying thousands of tiny numbers together (which causes numerical underflow β the computer simply returns 0), you now add manageable negative numbers. No loss of precision, no rounding errors.
MLE therefore always maximizes β(ΞΈ) = log L(ΞΈ).
L(ΞΈ|yββ¦yβ) = βα΅’ f(yα΅’|ΞΈ)With many small values (e.g. 0.04 Γ 0.09 Γ β¦ Γ 0.06) the product becomes extremely small β numerical underflow. Stage 2 shows this live: the product of the red lines is displayed in the info text and shrinks rapidly.
The logarithm converts the product into a sum:
β(ΞΈ) = Ξ£α΅’ log f(yα΅’|ΞΈ)Why does this work? The logarithm is a monotonically increasing function β whenever L(ΞΈ) gets larger, log L(ΞΈ) also gets larger, and wherever L(ΞΈ) reaches its maximum, log L(ΞΈ) reaches its maximum at exactly the same point ΞΈΜ. It does not matter whether you maximize L(ΞΈ) or β(ΞΈ) β the answer is always identical.
The key practical advantage: instead of multiplying thousands of tiny numbers together (which causes numerical underflow β the computer simply returns 0), you now add manageable negative numbers. No loss of precision, no rounding errors.
MLE therefore always maximizes β(ΞΈ) = log L(ΞΈ).
MLE β the principle & estimators β Stage 1 & 2
Maximum Likelihood Estimation finds ΞΈ that makes the observed data
most plausible:
This is not a verdict about ΞΈ itself β only about its compatibility with the data. Other ΞΈ values are not impossible, just less plausible.
Analytical MLE estimators (Normal):
ΞΌΜ = Θ³ Β· ΟΜΒ² = Ξ£(yα΅’βΘ³)Β²/n (biased β n not nβ1!)
In Stage 1: ΞΌΜ = y (one point); in Stage 2: ΞΌΜ = Θ³ (all points).
ΞΈΜ = argmax β(ΞΈ|y)This is not a verdict about ΞΈ itself β only about its compatibility with the data. Other ΞΈ values are not impossible, just less plausible.
Analytical MLE estimators (Normal):
ΞΌΜ = Θ³ Β· ΟΜΒ² = Ξ£(yα΅’βΘ³)Β²/n (biased β n not nβ1!)
In Stage 1: ΞΌΜ = y (one point); in Stage 2: ΞΌΜ = Θ³ (all points).
Multidimensional MLE: ΞΌ and Ο simultaneously β Stage 2
MLE can estimate multiple parameters simultaneously β a key advantage
over simple moment estimators.
For the Normal distribution there are two unknown parameters: ΞΌ (location) and Ο (spread). The log-likelihood becomes a surface over the (ΞΌ, Ο)-space β a mountain landscape instead of a curve. The joint MLE lies at the peak:
In Stage 2: activate "Vary Ο too" β the heatmap shows the 2D landscape. The bright spot is the joint peak. This principle scales to arbitrarily many parameters β this is how lm() and glm() in R work internally.
For the Normal distribution there are two unknown parameters: ΞΌ (location) and Ο (spread). The log-likelihood becomes a surface over the (ΞΌ, Ο)-space β a mountain landscape instead of a curve. The joint MLE lies at the peak:
ΞΌΜ = Θ³ ΟΜ = β(Ξ£(yα΅’βΘ³)Β²/n)In Stage 2: activate "Vary Ο too" β the heatmap shows the 2D landscape. The bright spot is the joint peak. This principle scales to arbitrarily many parameters β this is how lm() and glm() in R work internally.
MLE is general β Poisson & Bernoulli β Stage 3
MLE works for any parametric family β not just Normal.
This is the foundation of GLMs:
Poisson (training hours/week): Ξ»Μ = Θ³ Β· Log-likelihood: Ξ£α΅’ [yα΅’Β·log(Ξ») β Ξ»]
Bernoulli (min. performance 0/1): pΜ = proportion of ones Β· Log-likelihood: Ξ£α΅’ [yα΅’Β·log(p) + (1βyα΅’)Β·log(1βp)]
Switch between families in Stage 3: the likelihood landscape changes its shape β parabola for Bernoulli, asymmetric for Poisson β but the mechanism is identical: find the peak.
Poisson (training hours/week): Ξ»Μ = Θ³ Β· Log-likelihood: Ξ£α΅’ [yα΅’Β·log(Ξ») β Ξ»]
Bernoulli (min. performance 0/1): pΜ = proportion of ones Β· Log-likelihood: Ξ£α΅’ [yα΅’Β·log(p) + (1βyα΅’)Β·log(1βp)]
Switch between families in Stage 3: the likelihood landscape changes its shape β parabola for Bernoulli, asymmetric for Poisson β but the mechanism is identical: find the peak.
Model comparison with AIC & BIC β Stage 3
The maximized log-likelihood value βΜ can be used directly for model comparison:
AIC = β2Β·βΜ + 2k (k = number of parameters)
BIC = β2Β·βΜ + kΒ·log(n)
Smaller AIC/BIC = better fit at equal complexity. AIC penalizes less strongly than BIC β with large n, BIC favors more parsimonious models.
Application example: Does Poisson or Negative-Binomial fit the training-hours data better? Same data, different families β AIC/BIC decide. In Stage 3, AIC and BIC are computed live.
AIC = β2Β·βΜ + 2k (k = number of parameters)
BIC = β2Β·βΜ + kΒ·log(n)
Smaller AIC/BIC = better fit at equal complexity. AIC penalizes less strongly than BIC β with large n, BIC favors more parsimonious models.
Application example: Does Poisson or Negative-Binomial fit the training-hours data better? Same data, different families β AIC/BIC decide. In Stage 3, AIC and BIC are computed live.