Infectious disease forecasting

John M. Drake & Pej Rohani

What is a forecast?

A forecast is a quantitative statement about an event, outcome, or trend that has not yet been observed, conditional on prior data that has been observed.

Adapted from: Lauer, S.A. et al. 2020. Infectious disease forecasting for public health. In Population Biology of Vector-borne Diseases edited by J.M. Drake, M.B. Bonsall, and M.R. Strand. Oxford University Press.

Typically, it is assumed that forecasts are of future events, but this definition allows for “retrospective” forecasts, including nowcasts.

Probabilistic forecasts are an important special case.

Models

Both mechanistic and statistical models may be used for forecasting.

Statistical models typically outperform mechanistic models.

Discussion question: What do you expect to be the strengths and weaknesses of these two kinds of models?

Models used for forecasting infectious diseases

Mechanistic

pro: can predict the effect of intervention
con: underperform compared with statistical models

Statistical

pro: high performing, generalizable
con: assume stationary data-generating process

A 2014 scoping review of influenza forecasting models found that 17/35 models used compartmental models and 18/35 studies used statistical models

Chretien J-P, George D, Shaman J, Chitale RA, McKenzie FE 2014. Influenza Forecasting in Human Populations: A Scoping Review. PLoS ONE 9(4):e94130. doi:10.1371/journal.pone.0094130

Challenges for infectious disease forecasting

System complexity
Data sparsity
Behavior change
Forecasting feedback loop

Forecasting targets

Targets are the unknown (but verifiable) quantities that forecasts make quantitative statements about (i.e. an event, outcome, or trend).

Examples include:

Disease incidence, hospitalizations, or deaths at specific (future) times
Peak disease incidence, hospitalizations, or deaths
The time of peak disease incidence, hospitalizations or deaths
A binary indicator for whether incidence, hospitalizations, or deaths exceeds a specified threshold

Notation

Time ranges from 1 to \( T \), typically in regular intervals (days, weeks, etc.), i.e. \( t \in {1, 2, 3, ... T} \)
Observations (\( y \)) may be indexed by time, i.e., the time series \( {y_1, y_2, y_3, ... y_T} \) assumed to be draws from random variables \( Y_1, Y_2, Y_3,... Y_T \)
Covariates may also be indexed by time, i.e. \( x_1, x_2, x_3, ... x_T \) and may be scalar or vector-valued
Often we are interested in the predicted value of a target at a future relative time, indicated by a lag of \( k \), i.e. the value of the target at time \( t+k \) or the \( k \)-step-ahead prediction
Targets may or may not be indexed by time and are represented by random variables \( Z \).

The goal

In probabilistic forecasting, we typically seek a predictive density function \( f(z_t|y_{1:t-k}, t, x_{1:t-k}) \) such that \( \int f(z) dz = 1 \).

An empirical Bayes model (1)

Data Percent weighted influenza-like illness from the US Outpatient Influenza-like Illness Surveillance Network (wILIL)

Prediction targets: Epidemic onset, Peak height, Peak week, Duration (how long wILI remains above the 2013-2014 national baseline of 2%)

Bayes' Theorem:

\[ P(A|B) = \frac{P(B|A)P(B)}{P(B)} \]

In Bayesian Statistics, \( P(A) \) is known as the prior.

An empirical Bayes model (2)

Five procedures

Model past seasons' epidemic curves as smoothed versions plus noise. (“quadratic trend filtering”)
Construct prior probability for the current season's epidemic curve by considering “transformations” of past seasons' curves.
Estimate what the wILI values the in recent past will be after their final revisions, using non-final wILI and Google Flu Trends data.
Weight possibilities for the current season's epidemic curve using estimates of final revised wILI.
Calculate forecasting targets for each possibility; report results.

An empirical Bayes model (3)

Empirical Bayes 2013–2014 national forecast, retrospectively, using the final revisions of wILI values, using revised wILI data for different epidemiological weeks.

Brooks LC, Farrow DC, Hyun S, TibshiraniRJ, Rosenfeld R (2015) Flexible modeling of epidemics with an empirical Bayes framework. PLoS Computational Biology 11(8): e1004382.

A semi-parametric model

Data

Center for Systems Science and Engineering (Johns Hopkins University) COVID-19 cases and deaths
healthdata.gov COVID-19 Reported Patient Impact and Hospital Capacity by State
Time spent in residential areas according to Google community mobility reports
Vaccine uptake (December 2020 to May 2021)

Prediction targets:

1-28 day ahead hospital admissions
1-4 week ahead cases and deaths

A semi-parametric model

Time series of forecasting targets and range of revisions. Points are the California time series of COVID-19 indicators which our model is designed to forecast in the version of the data used to fit the model. The vertical line range gives the range of values that each observation had in all versions of the data used to fit our model.

A semi-parametric model

Indicators of the time spent at home. Grey points are the original data provided by Google of the amount of time people in California spent in residential areas relative to a pre-COVID-19 period. Black points are an example time series of the derived version which we use as a predictor of exposure in our model. Holiday effects in the original data are retained but weekly periodicity is removed.

A semi-parametric model

COVID-19 vaccination doses administered in California over time.

A semi-parametric model

Flow diagram of differential equations for expected values of state variables.

Local linearization (van Kampen expansion) supplies a differential equation for the covariance in fluctuations
Estimation using extended Kalman filter

A semi-parametric model

Effective reproduction number of the fitted model for California.

A semi-parametric model

Overall performance of short-term forecasts of cases and deaths. Lower scores indicate better performance. The GISST forecast consistently outperforms the COVID-19 Forecast Hub Baseline forecast and performs best for four week ahead death forecasts.

A semi-parametric model

Overall performance of short-term forecasts of hospital admissions. Lower scores indicate better performance. The GISST forecasts score better than the COVIDhub Ensemble forecasts at all horizons.

Ensembles

Ensemble models (commonly used in weather forecasting) combine the forecasts of multiple models into a single forecast, e.g. by averaging the individual model predictions weighted by past performance

Building a forecasting system

A forecasting system comprises an online algorithm to estimate \( f \), a target prediction function to transform \( f \) into the desired form for output (e.g. the probability that the target will be within a particular range, referred to as bins), and an estimate of performance.

George, D.B. et al. 2019. Technology to advance infectious disease forecasting for outbreak management. Nature Communications 10:3932 https://doi.org/10.1038/s41467-019-11901

Evaluating forecasts

Evaluation metrics should be defined and finite for all conceivable applications.

Forecasts should be evaluated using proper scoring rules.

Forecast accuracy should be evaluated using out-of-sample observations.

Two forecasting metrics for probabilistic forecasts

Log score: \( LogS = \frac{1}{T} \sum_{t=1}^T log \hat p (z_t) \)

Continuous ranked probability score: \( CRPS = \int \left( (F(z)) - \mathcal{H}(z-y) \right)^2 dz \) where \( F(z) \) is the forecasted cumulative distribution function and

\[ \mathcal{H}(z) = \begin{cases} 1,& \text{if } z > 1\\ 0, & z \leq 1 \end{cases} \]

is known as the Heaviside function. The CRPS generalizes the concept of the mean squared error of a point prediction to the case of the function-valued probabilistic forecast.

To evaluate performance over multiple observations, the scores may be averaged, e.g.

\( \frac{1}{T} \sum_{t=1}^T log \hat p (z_t) \).

Model training and testing

Data split into model training and testing sequences
Tuning typically performed via cross-validation (but this can be tricky with time series data)

Conclusion

Infectious disease forecasting is an active area of research
Models optimized for forecasting may be different to those optimized for inference
The structure of the forecasting problem creates specific problems for model construction, estimation, and evaluation
Semiparametric models are a promising approach to navigating the bias-variance tradeoff

For further reading

Brooks LC, Farrow DC, Hyun S, TibshiraniRJ, Rosenfeld R. 2015. Flexible modeling of epidemics with an empirical Bayes framework. PLoS Computational Biology 11:e1004382.

Gibson GC, Moran KR, Reich NG, Osthus D. 2021. Improving probabilistic infectious disease forecasting through coherence. PLoS Computational Biology 17:e1007623.

O'Dea EB, Drake JM. 2022. A semi-parametric, state-space compartmental model with time-dependent parameters for forecasting COVID-19 cases, hospitalizations and deaths. Journal of the Royal Society Interface 19:20210702.

License

Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/bync/3.0/. Please share and remix noncommercially, mentioning its origin.