1 Introduction

The open-source R (R Core Team 2020) package epidemia provides a framework for Bayesian, regression-oriented modeling of the temporal dynamics of infectious diseases. Typically, but not exclusively, these models are fit to areal time-series; i.e. aggregated event counts for a given population and period. Disease dynamics are described explicitly; observed data are linked to latent infections, which are in turn modeled as a self-exciting process tempered by time-varying reproduction numbers. Regression models are specified for several objects in the model. For example, reproduction numbers are expressed as a transformed predictor, which may include both covariates and autoregressive terms. A range of prior distributions can be specified for unknown parameters by leveraging the functionality of rstanarm (Goodrich et al. 2020). Multilevel models are supported by partially pooling covariate effects appearing in the predictor for reproduction numbers between multiple populations.

The mathematical framework motivating the implemented models has been described in Bhatt et al. (2020). Specific analyses using such models have appeared during the COVID-19 pandemic, and have been used to estimate the effect of control measures (Flaxman et al. 2020; Mellan et al. 2020; Olney et al. 2021), and to forecast disease dynamics under assumed epidemiological parameters and mitigation scenarios (Vollmer et al. 2020; Hawryluk et al. 2020). The modeling approach has been extended to estimate differences in transmissibility between COVID-19 lineages (Faria et al. 2021; Volz et al. 2021).

Models of infectious disease dynamics are commonly classified as either mechanistic or statistical (Myers et al. 2000). Mechanistic models derive infection dynamics from theoretical considerations over how diseases spread within and between communities. An example of this are deterministic compartmental models (DCMs) (Kermack, William Ogilvy and McKendrick 1927, 1932, 1933), which propose differential equations that govern the change in infections over time. These equations are motivated by contacts between individuals in susceptible and infected classes. Purely statistical models, on the other hand, make few assumptions over the transmission mechanism, and instead infer future dynamics from the history of the process and related covariates. Examples include Generalized Linear Models (GLMs), time series approaches including Auto Regressive Integrated Moving Average (ARIMA) (Box and Jenkins 1962), and more modern forecasting methods based on machine learning.

epidemia provides models which are semi-mechanistic. These are statistical models that explicitly describe infection dynamics. Self-exciting processes are used to propagate infections in discrete time. Previous infections directly precipitate new infections. Moreover, the memory kernel of the process allows an individual’s infectiousness to depend explicitly on the time since infection. This approach has been used in multiple previous works (Fraser 2007; Cori et al. 2013; Nouvellet et al. 2018; Cauchemez et al. 2008) and has been shown to correspond to a Susceptible-Exposed-Infected-Recovered (SEIR) model when a particular form for the generation distribution is used (Champredon, Dushoff, and Earn 2018). In addition, population adjustments may be applied to account for depletion of the susceptible population. The models are statistical in the sense that they define a likelihood function for the observed data. After also specifying prior distributions for model parameters, samples from the posterior can then be obtained using either Hamiltonian Monte Carlo or Variational Bayes methods.

The Bayesian approach has certain advantages in this context. Several aspects of these models are fundamentally unidentified (Roosa and Chowell 2019). For most diseases, infection counts are not fully observable and suffer from under-reporting (Gibbons et al. 2014). Recorded counts could be explained by a high infection and low ascertainment regime, or alternatively by low infections and high ascertainment. If a series of mitigation efforts are applied in sequence to control an epidemic, then the effects may be confounded and difficult to disentangle (Bhatt et al. 2020). Bayesian approaches using MCMC allow full exploration of posterior correlations between such coupled parameters. Informative, or weakly informative, priors may be incorporated to regularize, and help to mitigate identifiability problems, which may otherwise pose difficulties for sampling (Gelman et al. 2008; Gelman and Shalizi 2013).

epidemia’s functionality can be used for a number of purposes. A researcher can simulate infection dynamics under assumed parameters by setting tight priors around the assumed values. It is then possible to sample directly from the prior distribution without conditioning on data. This allows in-silico experimentation; for example, to assess the effect of varying a single parameter (reproduction numbers, seeded infections, incubation period). Another goal of modeling is to assess whether a simple and parsimonious model of reality can replicate observed phenomena. This helps to isolate processes helpful for explaining the data. Models of varying complexity can be specified within epidemia, largely as a result of its regression-oriented framework. Posterior predictive checks can be used to assess model fit. If the model is deemed misspecified, additional features may be considered. This could be modeling population adjustments, explicit modeling of super-spreader events (Wong and Collins 2020), alternative and over-dispersed models for the data, or more flexible functional forms for reproduction numbers or ascertainment rates. This can be done rapidly within epidemia’s framework.

Forecasting models are critical during an ongoing epidemic as they are used to inform policy decisions under uncertainty. As a sign of their importance, the United States Centers for Disease Control and Prevention (CDC) has run a series of forecasting challenges, including the FluSight seasonal forecasting challenges since 2015 (https://www.cdc.gov/flu/weekly/flusight/) and more recently the Covid-19 Forecast hub (https://covid19forecasthub.org/). Similar challenges have been run by the European Center for Disease Prevention and Control (ECDC) (https://covid19forecasthub.eu/). Long-term forecasts quantify the cost of an unmitigated epidemic, and provide a baseline from which to infer the effects of control measures. Short-term forecasts are crucial in informing decisions on how to distribute resources such as PPE or respirators, or whether hospitals should increase capacity and cancel less urgent procedures. Traditional statistical approaches often give unrealistic long-term forecasts as they do not explicitly account for population effects. The semi-mechanistic approach of epidemia combines the strengths of statistical approaches with plausible infection dynamics, and can thus be used for forecasting at different tenures.

1.1 Related packages

The Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/) provides a rich ecosystem of R packages dedicated to epidemiological analysis. The R Epidemics Consortium website (https://www.repidemicsconsortium.org/) lists a number of these. Packages that model infectious disease dynamics vary significantly by the methods used to model transmission. RLadyBug (Höhle and Feldmann 2007) is a R package for parameter estimation and simulation for stochastic compartmental models, including SEIR-type models. Both likelihood-based and Bayesian inference are supported. amei (Merl et al. 2010) provides online inference for a stochastic SIR model with a negative-binomial transmission function, however the primary focus is on identifying optimal intervention strategies. See Andersson and Britton (2000) for an introduction to stochastic epidemic modeling.

epinet (Groendyke and Welch 2018) and epimodel (Jenness, Goodreau, and Morris 2018) provide functionality to simulate compartmental models over contact networks. epinet uses the class of dyadic-independent exponential random graph models (ERGMs) to model the network, and perform full Bayesian inference over model parameters. epimodel considers instead dynamic networks, inferring only network parameters and assuming epidemic parameters to be known.

Epidemic data often presents in the form of areal data, recording event counts over disjoint groups during discrete time intervals. This is the prototypical data type supported within epidemia. Areal data can be modeled using purely statistical methods. The glm() function in stats can be used to fit simple time-series models to count data. The package acp (Vasileios 2015) allows for fitting autoregressive Poisson regression (ACP) models to count data, with potentially additional covariates. tscount (Liboschik, Fokianos, and Fried 2017) expands on acp, in particularly providing more flexible link functions and over-dispersed distributions.

Like epidemia, the R package Surveillance (Meyer, Held, and Höhle 2017) implements regression-oriented modeling of epidemic dynamics. The package offers models for three different spatial and temporal resolutions of epidemic data. For areal data, which is the focus of epidemia, the authors implement a multivariate time-series approach (Held, Höhle, and Hofmann 2005; Paul, Held, and Toschke 2008; Paul and Held 2011; Held and Paul 2012). This model differs from the semi-mechanistic approach used here in several ways. First, the model has no mechanistic component: neither infections and transmission are explicitly described. The model is similar in form to a vector autoregressive model of order 1 (\(\text{VAR}(1)\)). The lag 1 assumption implies that each count series is Markovian. In epidemia, the infection process has an interpretation as an AR process with both order and coefficients determined by the generation distribution. This can therefore model more flexible temporal dependence in observed data.

EpiEstim (Cori et al. 2013; Cori 2020) infers time-varying reproduction numbers \(R_t\) using case counts over time and an approximation of the disease’s generation distribution. Infection incidence is assumed to follow a Poisson process with expectation given by a renewal equation. R0 (Obadia, Haneef, and Boëlle 2012) implements techniques for estimating both initial and time-varying transmission rates. In particular, the package implements the method of Wallinga and Teunis (2004), which bases estimates off a probabilistic reconstruction of transmission trees. epidemia differs from these packages in several ways. First, if infection counts are low then the Poisson assumption may be too restrictive, as super-spreader events can lead to over-dispersion in the infection process. Our framework permits over-dispersed distributions for modeling latent infections. Second, epidemia allows flexible prior models for \(R_t\), including the ability to use time-series methods. For example, \(R_t\) can be parameterized as a random walk. Finally, infections over time are often unobserved, and subject to under-reporting that is both space and time dependent. We account for this by providing flexible observation models motivated by survival processes. Several count data series may be used simultaneously within the model in order to leverage additional information on \(R_t\).

The probabilistic programming language Stan (Stan Development Team 2018) has been used extensively to specify and fit Bayesian models for disease transmission during the Covid-19 pandemic. Examples analyses include Flaxman et al. (2020), Hauser et al. (2020) and Doremalen et al. (2020). For tutorials on implementing such models, see for example Grinsztajn et al. (2021) or Chatzilena et al. (2019). epidemia uses the framework offered by Stan to both specify and fit models. User-specified models are internally translated into data that is passed to a precompiled Stan program. The models are fit using sampling methods from rstan (Stan Development Team 2020).

2 References

Andersson, Håkan, and Tom Britton. 2000. Stochastic Epidemic Models and Their Statistical Analysis. Vol. 151. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4612-1158-7.

Bhatt, Samir, Neil Ferguson, Seth Flaxman, Axel Gandy, Swapnil Mishra, and James A Scott. 2020. “Semi-Mechanistic Bayesian Modeling of COVID-19 with Renewal Processes.” arXiv Preprint arXiv:2012.00394. https://arxiv.org/abs/2012.00394.

Box, G. E. P., and G. M. Jenkins. 1962. “Some Statistical Aspects of Adaptive Optimization and Control.” Journal of the Royal Statistical Society: Series B (Methodological) 24 (2): 297–331. https://doi.org/10.1111/j.2517-6161.1962.tb00460.x.

Cauchemez, Simon, Alain Jacques Valleron, Pierre Yves Boëlle, Antoine Flahault, and Neil M. Ferguson. 2008. “Estimating the impact of school closure on influenza transmission from Sentinel data.” Nature. https://doi.org/10.1038/nature06732.

Champredon, David, Jonathan Dushoff, and David J. D. Earn. 2018. “Equivalence of the Erlang-distributed SEIR epidemic model and the renewal equation.” SIAM Journal on Applied Mathematics. https://doi.org/10.1137/18M1186411.

Chatzilena, Anastasia, Edwin van Leeuwen, Oliver Ratmann, Marc Baguelin, and Nikolaos Demiris. 2019. “Contemporary statistical inference for infectious disease models using Stan.” Epidemics 29 (December). https://doi.org/10.1016/j.epidem.2019.100367.

Cori, Anne. 2020. EpiEstim: Estimate Time Varying Reproduction Numbers from Epidemic Curves. https://cran.r-project.org/package=EpiEstim.

Cori, Anne, Neil M. Ferguson, Christophe Fraser, and Simon Cauchemez. 2013. “A new framework and software to estimate time-varying reproduction numbers during epidemics.” American Journal of Epidemiology. https://doi.org/10.1093/aje/kwt133.

Doremalen, Neeltje van, Trenton Bushmaker, Dylan H. Morris, Myndi G. Holbrook, Amandine Gamble, Brandi N. Williamson, Azaibi Tamin, et al. 2020. “Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1.” New England Journal of Medicine 382 (16). https://doi.org/10.1056/NEJMc2004973.

Faria, Nuno R., Thomas A. Mellan, Charles Whittaker, Ingra M. Claro, Darlan da S. Candido, Swapnil Mishra, Myuki A. E. Crispim, et al. 2021. “Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil.” Science, April. https://doi.org/10.1126/science.abh2644.

Flaxman, Seth, Swapnil Mishra, Axel Gandy, H Juliette T Unwin, Thomas A Mellan, Helen Coupland, Charles Whittaker, et al. 2020. “Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe.” Nature. https://doi.org/10.1038/s41586-020-2405-7.

Fraser, Christophe. 2007. “Estimating individual and household reproduction numbers in an emerging epidemic.” PLoS ONE. https://doi.org/10.1371/journal.pone.0000758.

Gelman, Andrew, Aleks Jakulin, Maria Grazia Pittau, and Yu-Sung Su. 2008. “A weakly informative default prior distribution for logistic and other regression models.” The Annals of Applied Statistics 2 (4). https://doi.org/10.1214/08-AOAS191.

Gelman, Andrew, and Cosma Rohilla Shalizi. 2013. “Philosophy and the practice of Bayesian statistics.” British Journal of Mathematical and Statistical Psychology 66 (1). https://doi.org/10.1111/j.2044-8317.2011.02037.x.

Gibbons, Cheryl L, Marie-Josée J Mangen, Dietrich Plass, Arie H Havelaar, Russell John Brooke, Piotr Kramarz, Karen L Peterson, et al. 2014. “Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods.” BMC Public Health 14 (1). https://doi.org/10.1186/1471-2458-14-147.

Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2020. “rstanarm: Bayesian applied regression modeling via Stan.” https://mc-stan.org/rstanarm/.

Grinsztajn, Léo, Elizaveta Semenova, Charles C Margossian, and Julien Riou. 2021. “Bayesian workflow for disease transmission modeling in Stan.” http://arxiv.org/abs/2006.02985.

Groendyke, Chris, and David Welch. 2018. “epinet : An R Package to Analyze Epidemics Spread across Contact Networks.” Journal of Statistical Software 83 (11). https://doi.org/10.18637/jss.v083.i11.

Hauser, Anthony, Michel J. Counotte, Charles C. Margossian, Garyfallos Konstantinoudis, Nicola Low, Christian L. Althaus, and Julien Riou. 2020. “Estimation of SARS-CoV-2 mortality during the early stages of an epidemic: A modeling study in Hubei, China, and six regions in Europe.” PLOS Medicine 17 (7). https://doi.org/10.1371/journal.pmed.1003189.

Hawryluk, Iwona, Thomas A. Mellan, Henrique Hoeltgebaum, Swapnil Mishra, Ricardo P. Schnekenberg, Charles Whittaker, Harrison Zhu, et al. 2020. “Inference of COVID-19 epidemiological distributions from Brazilian hospital data.” Journal of the Royal Society Interface 17 (172). https://doi.org/10.1098/rsif.2020.0596.

Held, Leonhard, Michael Höhle, and Mathias Hofmann. 2005. “A statistical framework for the analysis of multivariate infectious disease surveillance counts.” Statistical Modelling 5 (3). https://doi.org/10.1191/1471082X05st098oa.

Held, Leonhard, and Michaela Paul. 2012. “Modeling seasonality in space-time infectious disease surveillance data.” Biometrical Journal 54 (6). https://doi.org/10.1002/bimj.201200037.

Höhle, Michael, and Ulrike Feldmann. 2007. “RLadyBug—An R package for stochastic epidemic models.” Computational Statistics & Data Analysis 52 (2). https://doi.org/10.1016/j.csda.2006.11.016.

Jenness, Samuel M., Steven M. Goodreau, and Martina Morris. 2018. “EpiModel : An R Package for Mathematical Modeling of Infectious Disease over Networks.” Journal of Statistical Software 84 (8). https://doi.org/10.18637/jss.v084.i08.

Kermack, William Ogilvy and McKendrick, A. G. 1927. “A contribution to the mathematical theory of epidemics.” Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character. https://doi.org/10.1098/rspa.1927.0118.

———. 1932. “Contributions to the mathematical theory of epidemics. II. —The problem of endemicity.” Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character 138 (834). https://doi.org/10.1098/rspa.1932.0171.

———. 1933. “Contributions to the mathematical theory of epidemics. III.—Further studies of the problem of endemicity.” Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character 141 (843). https://doi.org/10.1098/rspa.1933.0106.

Liboschik, Tobias, Konstantinos Fokianos, and Roland Fried. 2017. “tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models.” Journal of Statistical Software 82 (5): 1–51. https://doi.org/10.18637/jss.v082.i05.

Mellan, Thomas A., Henrique H. Hoeltgebaum, Swapnil Mishra, Charlie Whittaker, Ricardo P. Schnekenberg, Axel Gandy, H. Juliette T. Unwin, et al. 2020. “Subnational analysis of the COVID-19 epidemic in Brazil.” https://doi.org/10.1101/2020.05.09.20096701.

Merl, Daniel, Leah R. Johnson, Robert B. Gramacy, and Marc Mangel. 2010. “amei : An R Package for the Adaptive Management of Epidemiological Interventions.” Journal of Statistical Software 36 (6). https://doi.org/10.18637/jss.v036.i06.

Meyer, Sebastian, Leonhard Held, and Michael Höhle. 2017. “Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance.” Journal of Statistical Software 77 (11). https://doi.org/10.18637/jss.v077.i11.

Myers, M. F., D. J. Rogers, J. Cox, A. Flahault, and S. I. Hay. 2000. “Forecasting disease risk for increased epidemic preparedness in public health.” Advances in Parasitology 47: 309–30. https://doi.org/10.1016/s0065-308x(00)47013-2.

Nouvellet, Pierre, Anne Cori, Tini Garske, Isobel M. Blake, Ilaria Dorigatti, Wes Hinsley, Thibaut Jombart, et al. 2018. “A simple approach to measure transmissibility and forecast incidence.” Epidemics. https://doi.org/10.1016/j.epidem.2017.02.012.

Obadia, Thomas, Romana Haneef, and Pierre-Yves Boëlle. 2012. “The R0 package: a toolbox to estimate reproduction numbers for epidemic outbreaks.” BMC Medical Informatics and Decision Making 12 (1). https://doi.org/10.1186/1472-6947-12-147.

Olney, Andrew M, Jesse Smith, Saunak Sen, Fridtjof Thomas, and H Juliette T Unwin. 2021. “Estimating the Effect of Social Distancing Interventions on COVID-19 in the United States.” American Journal of Epidemiology, January. https://doi.org/10.1093/aje/kwaa293.

Paul, M., and L. Held. 2011. “Predictive assessment of a non-linear random effects model for multivariate time series of infectious disease counts.” Statistics in Medicine 30 (10). https://doi.org/10.1002/sim.4177.

Paul, M., L. Held, and A. M. Toschke. 2008. “Multivariate modelling of infectious disease surveillance data.” Statistics in Medicine 27 (29). https://doi.org/10.1002/sim.3440.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Roosa, Kimberlyn, and Gerardo Chowell. 2019. “Assessing parameter identifiability in compartmental dynamic models using a computational approach: application to infectious disease transmission models.” Theoretical Biology and Medical Modelling 16 (1). https://doi.org/10.1186/s12976-018-0097-6.

Stan Development Team. 2018. “The Stan Core Library.” http://mc-stan.org/.

———. 2020. “RStan: the R interface to Stan.” https://mc-stan.org/.

Vasileios, Siakoulis. 2015. acp: Autoregressive Conditional Poisson. https://cran.r-project.org/package=acp.

Vollmer, Michaela A. C., Swapnil Mishra, H. Juliette T Unwin, Axel Gandy, Thomas A. Mellan, Valerie Bradley, Harrison Zhu, et al. 2020. “Report 20: Using mobility to estimate the transmission intensity of COVID-19 in Italy: A subnational analysis with future scenarios.” https://doi.org/10.1101/2020.05.05.20089359.

Volz, Erik, Swapnil Mishra, Meera Chand, Jeffrey C. Barrett, Robert Johnson, Lily Geidelberg, Wes R. Hinsley, et al. 2021. “Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England.” Nature. https://doi.org/10.1038/s41586-021-03470-x.

Wallinga, Jacco, and Peter Teunis. 2004. “Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures.” American Journal of Epidemiology. https://doi.org/10.1093/aje/kwh255.

Wong, Felix, and James J. Collins. 2020. “Evidence that coronavirus superspreading is fat-tailed.” Proceedings of the National Academy of Sciences 117 (47). https://doi.org/10.1073/pnas.2018490117.