Model estimation by maximum likelihood

John M. Drake & Pej Rohani

Model calibration

Model calibration requires tuning the parameters to “best” represent reality

Evaluation of this representation is by

  • goodness-of-fit
  • generalizability

Least squares estimation

Model calibration is performed by adjusting the parameters to minimize some objective function (aka loss function)

For instance, least squares estimation minimizes the sum of squared errors

\[ SSE = \sum_{i=1}^n (y_i - f(x_i))^2 \]

where \( y_i \) is the \( i^{th} \) observation, \( x_i \) are the inputs to the model, and \( f() \) is the model.

Least squares estimation

The “adjustment” of parameters is typically performed by numerical optimization

Least squares estimation: example

#The Model
f <- function(a,x) x^a 

x0 <- seq(0,100)
a0 <- 2.2
y0 <- f(a0, x0)

plot(x0,y0, type='l', xlab='x', ylab='f(x)', main='The truth', ylim=c(-5000,30000))

plot of chunk unnamed-chunk-1

Least squares estimation

With some data

y1 <- y0 + rnorm(101, sd=3000)
data <- data.frame(x0,y1)
plot(x0,y0, type='l', xlab='x', ylab='f(x)', main='Model and data', ylim=c(-5000,30000))
points(data)
lines(c(x0[10],x0[10]), c(y1[10], y0[10]))

plot of chunk unnamed-chunk-2

Least squares estimation

The objective function

obj <- function(parms0, data){
  y <- f(parms0, data$x0)
  sse <- sum((data$y1-y)^2)
}

guess1 <- 1.8
guess2 <- 2.1
guess3 <- 2.4

sse1 <- obj(guess1,data)
sse2 <- obj(guess2,data)
sse3 <- obj(guess3,data)


plot of chunk unnamed-chunk-4

Least squares estimation

Optimization is easy for a single parameter

plot of chunk unnamed-chunk-5

Least squares estimation

With multiple parameters we use optimization

p0 <- 1.8
fit <- optim(p0, obj, data=data, method='Brent', lower=1.6, upper=2.5)
print(fit)
$par
[1] 2.19926

$value
[1] 944984983

$counts
function gradient 
      NA       NA 

$convergence
[1] 0

$message
NULL

Likelihood theory

What's so special about the sum of squared errors?

Likelihood theory substitutes an alternative objective function (the likelihood function) which is very closely related to the probability function

Suppose we have a continuous probability density \( f(x|\theta) \), for instance the pdf of the Gaussian distribution:

\[ f(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} e^{\frac{(x-\mu)^2}{2\sigma^2}} \]

From the laws of probability, for \( n \) independent observations \( x_1, x_2, x_3, ... x_n \), the joint density is

\[ f(x_1, x_2, x_3, ... x_n | \mu, \sigma) = \prod_{i=1}^n f(x_i|\mu, \sigma) \]

Likelihood theory

The idea behind likelihood is to interpret this formula as a function of the parameters given the data, written

\[ \mathcal{L} (\mu, \sigma ; x_1, x_2, x_3, ... x_n) = \prod_{i=1}^n f(x_i|\mu, \sigma) \]

Note: Likelihoods are not probabilities. They don't integrate to one and they don't specify the relative probability of different parameter values.

Likelihood theory

It is often convenient to transform likelihoods using the natural logarithm, in which case we work in the space of log-likelihood

\[ \mathcal{l} (\theta ; \mathbf{y} ) = \ln \mathcal{L}(\theta; \mathbf{y}) \]

The idea of maximum likelihood is to find those values of the parameters that maximize the likelihood. Since the logarithm is a monotonic transform, the parameters that maximize the log-likelihood are the same as the parameters that maximize the likelihood.

Finally, many optimization algorithms are set to minimize a function (as in minimizing sum of squared errors), therefore, we often multiple by negative one and talk about minimizing the negative log-likelihood.

Properties of maximum likelihood estimators

As sample size increases to infinity, sequences of maximum likelihood estimators are:

  • Consistent (MLEs converge in probability to the value being estimated, related to bias.)
  • Functionally invariant (If \( \hat \theta \) is the maximum likelihood estimator for \( \theta \), and if \( g(\theta ) \) is any transformation of \( \theta \), then the maximum likelihood estimator for \( \alpha =g(\theta ) \) is \( \hat \alpha=g(\hat \theta \).)
  • Efficient (No consistent estimator has lower asymptotic mean squared error; note this is the error of the estimate)
  • Asymptotically normal (With large samples estimates will be normally distributed, which supports inference)

Note: There exists a large literature on statistical regularization that trades off bias for accuracy, particularly useful for estimation with small sample sizes. Such estimation is not yet widely used with epidemic models.

The Swiss army knife principle

Steve Ellner: When it can do the job (which is most of the time), maximum likelihood is rarely the best possible tool, but it's rarely much worse than the best.

A practical trick

Transformations can be used to naturally bound parameters and provide smoother gradients for optimization

  • For positive parameters (e.g. \( \beta \), \( \gamma \)) a log-transform is recommended, i.e. rewrite the model in terms of \( b=\ln \beta \implies \beta=e^b \) and \( g=\ln \gamma \implies \gamma = e^g \). Thus, for example, we might right \( dS/dt = -e^b SI \).
  • For fractions (\( 0<\phi<1 \)), such as survival probability, rewrite the model in terms of the logit transform \( p = \ln \left(\frac{\phi}{(1-\phi)} \right) \)

Relationship between likelihood and least squares

A special case: When the errors are normally distributed with constant variance then the least squares estimate and the maximum likelihood estimate are equivalent.

\[ f(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} e^{\frac{(x-\mu)^2}{2\sigma^2}} \]

However, this does not generalize.

Likelihood function for the binomial distribution

Imagine we have a binomial sample (e.g. \( X \) deaths in \( n \) cases). We may assume the cases are identical and the chance of death in each case is identical with case fatality rate \( p \). Then, the probability of getting exactly \( X=k \) deaths is given by the probability mass function of the binomial distribution

\[ f(k,n,p) = Pr(X=k) = {n \choose k} p^k(1-p)^{n-k} \]

where

\[ {n \choose k } = \frac{n!}{k!(n-k)!} \]

is known as the binomial coefficient.

Likelihood function for the binomial distribution

The function \( f(k,n,p) \) is available in R using the function dbinom. Hence, for 5 cases and 1 death we have the following likelihood.

p <- seq(0,1,0.01)
f <- dbinom(1,5,p)
plot(p,f, type='h', xlab='p', ylab='likelihood')
abline(v=0.2, lwd=3, col='red')

plot of chunk unnamed-chunk-7

Likelihood function for the binomial distribution

We have more information with 50 cases and 10 deaths. Also notice that the likelihood function is more symmetric as we obtain more data.

p <- seq(0,1,0.01)
f <- dbinom(10,50,p)
plot(p,f, type='h', xlab='p', ylab='likelihood')
abline(v=0.2, lwd=3, col='red')

plot of chunk unnamed-chunk-8

Estimating uncertainty

Our sense that our estimate of the case fatality rate (20%) is more precise when we have more data (10 deaths out of 50 cases compared with 1 death out of 5 cases) can be made mathematically rigorous.

The fact that MLEs are asymptotically normal can be used to estimate standard errors and approximate confidence intervals.

Constructing these intervals mathematically is beyond the scope of this class, but they are readily obtained from several software packages.

Likelihood profile

Another kind of confidence intervals are based on the likelihood profile, where (for one parameter), two values of a parameter are statistically different if the log-likeihoods differ by 1.96 (from the \( \chi^2 \) distribution).

This is better illustrated with a continuous example, so we go back to these data:

plot(x0,y0, type='l', xlab='x', ylab='f(x)', main='Model and data', ylim=c(-5000,30000))
points(data)
lines(c(x0[10],x0[10]), c(y1[10], y0[10]))

plot of chunk unnamed-chunk-9

Likelihood profile

We have the following SSE

plot of chunk unnamed-chunk-10

and likelihood (notice the y-axis limits)

plot of chunk unnamed-chunk-11

Likelihood profile

Here we look at just a portion of the likelihood function near the maximum. The horizontal lines show the maximum and a value 1.96 log-likelihoods less than the maximum. The heavy vertical line shows the maximum likelihood estimate and the lighter vertical lines show the confidence interval.

plot of chunk unnamed-chunk-12

The bbmle package

The \texttt{bbmle} package contains many auxiliary functions that are useful when fitting models with maximum likelihood

  • \texttt{mle2} for fitting models
  • \texttt{profile} for computing likelihood profiles
  • \texttt{confint} for obtaining confidence intervals
  • \texttt{stdEr} for obtaining standard errors

Constructing the observables

The state variables of a model may not correspond to the observables. For instance, reported epidemic curves are not the same as the state variable \( I \):

  • Observed case counts are typically aggregated over a reporting interval (day, week, month)
  • Observed case counts are based on diagnosis that may occur at different points in disease progression
  • There may be lags between diagnostic date and reporting date

These idiosyncrasies can be accommodated be developing an observation process model that runs on top of the transmission model

  • e.g. Total cases over time based on diagnosis at the end of infectiousness would be \( dC/dt=\gamma I \) and case reports in interval \( (t, t+\delta t) \) would be \( C_{t+\delta t} - C_t \). Alternatively, cases might be associated with symptom onset date.

Likelihood Ratio Test

Likelihood ratio test for nested models

The likelihood ratio test statistic is given by

\( \lambda_{LR} =−2[l(\theta_0)−l(\theta_1)] \)

where \( \theta_0 \) is a null hypothesis and \( \theta_1 \) is an alternative hypothesis.

Wilks theorem establishes that if the null hypothesis is true, \( \lambda_{LR} \) is asymptotically distributed with a \( \chi^2 \) distribution.

This can be used to perform custom hypothesis tests when we encode null and alternative hypotheses in the \( \theta_0 \) and \( \theta_1 \) vectors.

Akaike's Information Criterion

Akaike's Information Criterion (AIC) can be used for non-nested models.

The AIC of a model is given by

\( AIC=2𝑘−2log(L(\theta)) \) where \( k \) is the number of parameters of the model.

AIC differences greater than 2 are generally considered to be statistically significantly different.

Acknowledgements

Presentations and exercises draw significantly from materials developed with Pej Rohani, Ben Bolker, Matt Ferrari, Aaron King, and Dave Smith used during the 2009-2011 Ecology and Evolution of Infectious Diseases workshops and the 2009-2019 Summer Institutes in Statistics and Modeling of Infectious Diseases.

Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/bync/3.0/. Please share and remix noncommercially, mentioning its origin.