John M. Drake
Model calibration requires tuning the parameters to “best” represent reality
Evaluation of this representation is by
Model calibration is performed by adjusting the parameters to minimize some objective function (aka loss function)
For instance, least squares estimation minimizes the sum of squared errors
\[ SSE = \sum_{i=1}^n (y_i - f(x_i))^2 \]
where \( y_i \) is the \( i^{th} \) observation, \( x_i \) are the inputs to the model, and \( f() \) is the model.
The “adjustment” of parameters is typically performed by numerical optimization
#The Model
f <- function(a,x) x^a
x0 <- seq(0,100)
a0 <- 2.2
y0 <- f(a0, x0)
plot(x0,y0, type='l', xlab='x', ylab='f(x)', main='The truth', ylim=c(-5000,30000))
With some data
y1 <- y0 + rnorm(101, sd=3000)
data <- data.frame(x0,y1)
plot(x0,y0, type='l', xlab='x', ylab='f(x)', main='Model and data', ylim=c(-5000,30000))
points(data)
lines(c(x0[10],x0[10]), c(y1[10], y0[10]))
The objective function
obj <- function(parms0, data){
y <- f(parms0, data$x0)
sse <- sum((data$y1-y)^2)
}
guess1 <- 1.8
guess2 <- 2.1
guess3 <- 2.4
sse1 <- obj(guess1,data)
sse2 <- obj(guess2,data)
sse3 <- obj(guess3,data)
Optimization is easy for a single parameter
With multiple parameters we use optimization
p0 <- 1.8
fit <- optim(p0, obj, data=data, method='Brent', lower=1.6, upper=2.5)
print(fit)
$par
[1] 2.193806
$value
[1] 854838252
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
What's so special about the sum of squared errors?
Likelihood theory substitutes an alternative objective function (the likelihood function) which is very closely related to the probability function
Suppose we have a continuous probability density \( f(x|\theta) \), for instance the pdf of the Gaussian distribution:
\[ f(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} e^{\frac{(x-\mu)^2}{2\sigma^2}} \]
From the laws of probability, for \( n \) independent observations \( x_1, x_2, x_3, ... x_n \), the joint density is
\[ f(x_1, x_2, x_3, ... x_n | \mu, \sigma) = \prod_{i=1}^n f(x_i|\mu, \sigma) \]
The idea behind likelihood is to interpret this formula as a function of the parameters given the data, written
\[ \mathcal{L} (\mu, \sigma ; x_1, x_2, x_3, ... x_n) = \prod_{i=1}^n f(x_i|\mu, \sigma) \]
Note: Likelihoods are not probabilities. They don't integrate to one and they don't specify the relative probability of different parameter values.
It is often convenient to transform likelihoods using the natural logarithm, in which case we work in the space of log-likelihood
\[ \mathcal{l} (\theta ; \mathbf{y} ) = \ln \mathcal{L}(\theta; \mathbf{y}) \]
The idea of maximum likelihood is to find those values of the parameters that maximize the likelihood. Since the logarithm is a monotonic transform, the parameters that maximize the log-likelihood are the same as the parameters that maximize the likelihood.
Finally, many optimization algorithms are set to minimize a function (as in minimizing sum of squared errors), therefore, we often multiple by negative one and talk about minimizing the negative log-likelihood.
As sample size increases to infinity, sequences of maximum likelihood estimators are:
Note: There exists a large literature on statistical regularization that trades of bias for accuracy, particularly useful for estimation with small sample sizes. Such estimation is not yet widely used with epidemic models.
Steve Ellner: When it can do the job (which is most of the time), maximum likelihood is rarely the best possible tool, but it's rarely much worse than the best.
Transformations can be used to naturally bound parameters and provide smoother gradients for optimization
A special case: When the errors are normally distributed with constant variance then the least squares estimate and the maximum likelihood estimate are equivalent.
\[ f(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} e^{\frac{(x-\mu)^2}{2\sigma^2}} \]
However, this does not generalize.
Imagine we have a binomial sample (e.g. \( X \) deaths in \( n \) cases). We may assume the cases are identical and the chance of death in each case is identical with case fatality rate \( p \). Then, the probability of getting exactly \( X=k \) deaths is given by the probability mass function of the binomial distribution
\[ f(k,n,p) = Pr(X=k) = {n \choose k} p^k(1-p)^{n-k} \]
where
\[ {n \choose k } = \frac{n!}{k!(n-k)!} \]
is known as the binomial coefficient.
The function \( f(k,n,p) \) is available in R using the function dbinom
. Hence, for 5 cases and 1 death we have the following likelihood.
p <- seq(0,1,0.01)
f <- dbinom(1,5,p)
plot(p,f, type='h', xlab='p', ylab='likelihood')
abline(v=0.2, lwd=3, col='red')
We have more information with 50 cases and 10 deaths. Also notice that the likelihood function is more symmetric as we obtain more data.
p <- seq(0,1,0.01)
f <- dbinom(10,50,p)
plot(p,f, type='h', xlab='p', ylab='likelihood')
abline(v=0.2, lwd=3, col='red')
Our sense that our estimate of the case fatality rate (20%) is more precise when we have more data (10 deaths out of 50 cases compared with 1 death out of 5 cases) can be made mathematically rigorous.
The fact that MLEs are asymptotically normal can be used to estimate standard errors and approximate confidence intervals.
Constructing these intervals mathematically is beyond the scope of this class, but they are readily obtained from several software packages.
Another kind of confidence intervals are based on the likelihood profile, where (for one parameter), two values of a parameter are statistically different if the log-likelihoods differ by 1.96 (from the \( \chi^2 \) distribution).
This is better illustrated with a continuous example, so we go back to these data:
plot(x0,y0, type='l', xlab='x', ylab='f(x)', main='Model and data', ylim=c(-5000,30000))
points(data)
lines(c(x0[10],x0[10]), c(y1[10], y0[10]))
We have the following SSE
and log-likelihood (notice the y-axis limits)
Here we look at just a portion of the likelihood function near the maximum. The horizontal lines show the maximum and a value 1.96 log-likelihoods less than the maximum. The heavy vertical line shows the maximum likelihood estimate and the lighter vertical lines show the confidence interval.
A package by Ben Bolker (“bb”) with tools for maximum likelihood estimation (“mle”)
mle2
: uses the optim
optimizer to find the maximum likelihood estimates when the user specifies an R function that evaluates the negative log-likelihoodprofile
: will efficiently calculate the likelihood profile identifying the confidence intervalsThe state variables of a model may not correspond to the observables. For instance, reported epidemic curves are not the same as the state variable \( I \):
These idiosyncrasies can be accommodated be developing an observation process model that runs on top of the transmission model
Likelihood ratio test for nested models
The likelihood ratio test statistic is given by
\[ \lambda_{LR} = -2 \left[ \mathcal l(\theta_0) - \mathcal l(\theta_1) \right] \]
where \( \theta_0 \) is a null hypothesis and \( \theta_1 \) is an alternative hypothesis.
Wilks theorem establishes that if the null hypothesis is true, \( \lambda_{LR} \) is asymptotically distributed with a \( \chi^2 \) distribution. This can be used to perform custom hypothesis tests when we encode null and alternative hypotheses in the \( \theta_0 \) and \( \theta_1 \) vectors.
Akaike's Information Criterion (AIC) can be used for non-nested models.
The AIC of a model is given by
\[ AIC = 2k-2\log(\mathcal L (\theta)) \]
where \( k \) is the number of parameters of the model.
AIC differences greater than 2 are generally considered to be statistically significantly different.
Presentations and exercises draw significantly from materials developed with Pej Rohani, Ben Bolker, Matt Ferrari, Aaron King, and Dave Smith used during the 2009-2011 Ecology and Evolution of Infectious Diseases workshops and the 2009-2019 Summer Institutes in Statistics and Modeling of Infectious Diseases.
Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/bync/3.0/. Please share and remix noncommercially, mentioning its origin.