Here we give a high-level overview of the workflow required for defining and fitting a model with epidemia. The primary model fitting function is epim()
. This takes a model description and additional arguments relating to the fitting algorithm, and proceeds to fit the model using a precompiled Stan program. This is similar to the workflow for fitting Bayesian regression models
with rstanarm. A key difference, however, is that the models fit by epidemia are generally complex, and are therefore inherently more difficult to specify. We simplify this process by taking a modular approach; models are defined through three distinct parts: transmission, infections and observations. These components of the model are defined with the functions epirt()
, epiinf()
and
epiobs()
respectively.
The package contains an example data set EuropeCovid
which contains data
on daily death counts from Covid-19 in 11 European Countries from February through May 2020, and a set
of binary indicators of non-pharmaceutical interventions. This is
used as an example throughout.
We begin by describing epim()
in more detail, and then proceed to discuss the three modeling functions.
epim()
is the only model fitting function in epidemia. It has arguments rt
, inf
, and obs
which
expect a description of the transmission model, infection model and all observational models respectively.
Together, these fully define the joint distribution of data and parameters. Each of these model components are
described in terms of variables that are expected to live in a single data frame, data
. This data frame
must be compatible with the model components, in the sense that it holds all variables defined in these
models. For our example, these variables are the following.
data <- EuropeCovid$data colnames(data)
## [1] "country" "date"
## [3] "schools_universities" "self_isolating_if_ill"
## [5] "public_events" "lockdown"
## [7] "social_distancing_encouraged" "deaths"
## [9] "pop"
The data
argument is described in more detail in Section 1.5.
In addition to taking a model description and a data frame,
epim()
has various additional arguments which specify how the model should be
fit. If algorithm = "sampling"
then the model
will be fit using Stan’s adaptive Hamiltonian Monte Carlo sampler (Hoffman and Gelman 2014). This
is done internally by calling sampling()
from rstan. If instead this is "meanfield"
or "fullrank"
, then Stan’s Variational Bayes
algorithms (Kucukelbir et al. 2015, 2017) are employed by calling vb()
from rstan. Any unnamed
arguments in the call to epim()
are passed directly onto the rstan
sampling function. epim()
returns a fitted model object of class
epimodel
, which contains posterior samples from the model along with other
useful objects.
In general, Hamiltonian Monte Carlo should be used for final inference. Nonetheless, this is often computationally demanding, and Variational Bayes can often be
used fruitful for quickly iterating models. All arguments for epim()
are described in Table
1.1.
Argument | Description |
---|---|
rt
|
An object of class epirt , resulting from a call to epirt() . This defines the model for time-varying reproduction numbers \(R_t\).
|
inf
|
An object of class epiinf , resulting from a call to epiinf() . This entirely defines the model for infections \(i_t\).
|
obs
|
Either an object of class epiobs , or a list of such objects. Each of these define a model for an observation vector in data , and result from a call to epiobs() . Each element of the list defines a model for an observed variable.
|
data
|
A dataframe with all data required for fitting the model. This includes all observations and covariates specified in the model. |
algorithm
|
One of "sampling" , "meanfield" or "fullrank" . This determines the rstan sampling function to use for fitting the model. "sampling" corresponds to HMC, while "meanfield" and "fullrank" are Variational Bayes algorithms.
|
group_subset
|
If specified, a character vector naming a subset of groups/populations to include in the model. |
prior_PD
|
If TRUE , parameters are sampled from their prior distributions. This is useful for prior predictive checks. Defaults to FALSE .
|
… |
Additional arguments to pass to the rstan function used to fit the model. If algorithm = "sampling" , then this function is sampling() . Otherwise vb() is used.
|
epirt()
defines the model for time-varying reproduction numbers, which
was described in Section 1.4 of the model description article. Recall that these are modeled as a transformed linear predictor. epirt()
has a formula
argument which defines the linear predictor \(\eta\), an argument link
defining the link function g
, and additional arguments to specify priors on parameters making up \(\eta\).
A general R formula gives a symbolic description of a model. It takes the form
y ~ model
, where y
is the response and model
is a collection of terms
separated by the +
operator. model
fully defines a linear predictor used to
predict y
. In this case, the “response” being modeled are reproduction numbers which are
unobserved. epirt()
therefore requires that
the left hand side of the formula takes the form R(group, date)
, where group
and date
refer to variables representing the modeled populations and dates respectively.
The right hand side can consist of fixed effects, random effects, and
autocorrelation terms. For our example, a viable call to epirt()
is the
following.
rt <- epirt(formula = R(country, date) ~ 1 + lockdown + public_events, link = scaled_logit(7))
Here, two fixed effects are included which represent the
effects of implementing lockdown and banning public events. These effects are assumed constant across
countries. They could alternatively be partially
pooled by using the term (lockdown + public_events | country)
.
For information on how to interpret such terms, please see partial pooling. Using link = scaled_logit(7)
lets the link function be the
scaled logit link described by Equation (1.7) in the model description article,
where \(K = 7\) is the maximum possible value for reproduction numbers.
For simplicity, we have omitted any prior arguments, however these
should generally be specified explicitly. Please see
prior for detailed information on how to use priors. All arguments
for epirt()
are listed in Table 1.1.
Argument | Description |
---|---|
formula
|
An object of class formula which determines the linear predictor \(\eta\) for \(R\). The left hand side must take the form R(group, date) , where group must be a factor vector indicating group membership (i.e. country, state, age cohort), and date must be a vector of class Date . This is syntactic sugar for the reproduction number in the given group at the give date.
|
link
|
The link function \(g\). Can be "log" , "identity" or a call to scaled_logit() . Defaults to "log" .
|
center
|
If TRUE , covariates specified in formula are centered to have mean zero. All priors should then be interpreted as priors on the centered covariates.
|
prior
|
Same as in stan_glm() from rstanarm. Defines the prior on fixed effects \(\beta\). Priors provided by rstanarm can be used, and additionally shifted_gamma . Note: if autoscale = TRUE in the call to the prior function, then automatic rescaling takes place.
|
prior_intercept
|
Same as in stan_glm() from rstanarm. Prior for the regression intercept \(\beta_0\) (if it exists).
|
prior\_covariance
|
Same as in stan_glmer() from rstanarm. Defines the prior on the covariance matrix \(\Sigma\). Only use if the formula has one or more terms of the form (x | y) , in which case there are parameters to partially pool, i.e. \(b\) has positive length.
|
...
|
Additional arguments to pass to model.frame() from stats.
|
The infection model is represented by epiinf()
. In the most basic version, this defines the distribution of the generation time of the disease, the number of days for which to seed infections, and the prior distribution on seeded infections. These three parameters are
controlled by the arguments gen
, seed_days
and prior_seeds
respectively.
A possible model is the following.
inf <- epiinf(gen = EuropeCovid$si, seed_days = 6L, prior_seeds = hexp(exponential(0.02)))
EuropeCovid$si
is a numeric vector representing the distribution for the serial interval of Covid-19. There is an implicit assumption that the generation time can be approximated well by the serial interval. Seeds are modeled hierarchically, and are described by (1.4) and (1.5) in the model description. \(\tau\) has been assigned an exponential prior with a mean of 50. Seeded infections are assumed to occur over a period of 6 days.
epiinf()
has additional arguments that allow the user to extend the basic
model. Using latent = TRUE
replaces the renewal equation with
Equation (1.8). Daily infections are then treated as latent parameters that are
sampled along with other parameters. The family
argument specifies
the distribution \(p(i'_t, d)\), while prior_aux
defines the prior on the
coefficient of dispersion \(d\).
Recall from the model description that the infection process may be
modified to explicitly account for changes in infection rates as the remaining
susceptible population is depleted. In particular, the adjustment ensures that cumulative infections
never breaches the population size. It can be employed by
setting pop_adjust = TRUE
and using the
pop
argument to point towards a static variable in the data frame giving the population size. All argument to epiinf()
are
described in Table 1.3.
Argument | Description |
---|---|
gen
|
A numeric vector giving the probability mass function \(g_k\) for the generation time of the disease (must be a probability vector). |
seed_days
|
An integer giving the number of days \(v + 1\) for which to seed infections. Defaults to 6L. |
prior_seeds
|
Prior distribution on the seed parameter \(i\). Defaults to hexp(prior_aux = rstanarm::exponential(0.03)) .
|
latent
|
If TRUE , treat infections as latent parameters.
|
family
|
Specifies the family for the infection distribution \(p(i'_t, d)\). Only used if latent = TRUE , and defaults to "normal" .
|
prior_aux
|
Prior on the auxiliary variable \(d\) of \(p(i'_t,d)\). This is either the variance-to-mean ratio or the coefficient of variation, depending on the value of fixed_vtm . Only used if latent = TRUE .
|
fixed_vtm
|
If TRUE , then \(p(i'_t, d)\) has a fixed variance-to-mean ratio, i.e. variance is \(\sigma^2 = d i'_t\); In this case, \(d\) refers to the variance-to-mean ratio. Id FALSE then instead standard deviation is assumed proportional to the mean, in which case \(d\) is the coefficient of variation. Only used if latent = TRUE .
|
pop_adjust
|
If TRUE , applies the population adjustment to the infection process.
|
pops
|
A character vector giving the population variable. Only used if pop_adjust = TRUE .
|
prior_susc
|
Prior on \(S_{v-1} / P\), the initial susceptible population as a proportion of the population size. If NULL , this is assumed to be equal to 1 (i.e. everyone is initially susceptible). Otherwise, can be a call to normal() from rstanarm, which assigns a normal prior truncated to \([0,1]\). Only used if pop_adjust = TRUE .
|
rm
|
A character vector giving the variable corresponding to \(v_t\), i.e. the proportion of \(S_t\) to remove at time \(t\). Only used if pop_adjust = TRUE .
|
prior_rm_noise
|
Prior on the parameter \(\xi\), which controls noise around \(v_t\). If NULL , no noise is added. Only used if pop_adjust = TRUE .
|
An observational model is defined by a call to epiobs()
. In particular,
this must also make explicit the model for the multipliers \(\alpha_t\), and must
also specify the coefficients \(\pi_k\). epiobs()
has a formula
argument. The left hand
side must indicate the observation vector to be modeled, while the
right hand side defines a linear predictor for \(\alpha_t\). The argument i2o
plays a similar role to the gen
argument in epiinf()
,
however it instead corresponds the vector \(\pi\) in Equation (1.2).
Take for example the task of modeling daily deaths
, which as we saw is
a variable in data
. A possible model is the following.
deaths <- epiobs(formula = deaths ~ 1, i2o = EuropeCovid$inf2death, link = scaled_logit(0.02))
Here \(\alpha_t\) corresponds to the infection fatality rate (IFR), and is modeled as an
intercept transformed by the scaled-logit link. This implies that the IFR is constant over time and its value lies somewhere between 0% and 2%. If the prior on
the intercept (specified by the prior_intercept
argument) is chosen to be
symmetric around zero, then the prior mean for the IFR is 1%. EuropeCovid$inf2death
is a numeric simplex vector that gives the same delay distribution as used in
Flaxman et al. (2020). This is a density function for a discretized mixture of Gamma
random variables.
Additional arguments include family
, which specifies the sampling distribution
\(p(y_t, \phi)\). There are also arguments allowing the user to control
prior distributions for effects in the linear predictor for \(\alpha_t\), and the prior on the
auxiliary variable \(\phi\). All arguments to epiobs()
are shown in Table
1.4.
Argument | Description |
---|---|
formula
|
An object of class formula which determines the linear predictor for the ascertainment rate. The left hand side must define the response that is being modeled (i.e. the actual observations, not the latent ascertainments)
|
i2o
|
A numeric (probability) vector defining the probability mass function \(\pi_k\) of the time from an infection to an observation. |
family
|
A string representing the family of the sampling distribution \(p(y_t,\phi)\). Can be one of "poisson" , "neg_binom" , "quasi_poisson" , "normal" or "log_normal" .
|
link
|
A string representing the link function used to transform the linear predictor. Can be one of "logit" , "probit" , "cauchit" , "cloglog" , "identity" . Defaults to "logit" .
|
center ,
prior ,
prior_intercept
|
same as in epirt() , described above.
|
prior_aux
|
The prior distribution for the auxiliary parameter \(\phi\), if it exists. Only used if family is "neg_binom" (reciprocal dispersion), "quasi_poisson" (dispersion), "normal" (standard deviation) or "log_normal" (sigma parameter).
|
...
|
Additional arguments for model.frame() from stats.
|
Before fitting our first model in Section 1.6, we elaborate on
the data
argument to epim()
. Recall that this must contain all variables used in the transmission and infection models, and in all observational models. For our example,
data
looks like
head(data)
## # A tibble: 6 x 9
## # Groups: country [1]
## country date schools_universit… self_isolating_i… public_events lockdown
## <fct> <date> <int> <int> <int> <int>
## 1 Austria 2020-02-22 0 0 0 0
## 2 Austria 2020-02-23 0 0 0 0
## 3 Austria 2020-02-24 0 0 0 0
## 4 Austria 2020-02-25 0 0 0 0
## 5 Austria 2020-02-26 0 0 0 0
## 6 Austria 2020-02-27 0 0 0 0
## # … with 3 more variables: social_distancing_encouraged <int>, deaths <int>,
## # pop <int>
The columns country
and date
define the region and time period corresponding to each of the remaining
variables. epim()
assumes that the first seeding day (i.e. the start of the epidemic) in each region is the first date found in the data frame. The last data found for each region is the final data at which the epidemic is simulated. It is up to the user to appropriately choose these dates. For our example,
the first and last dates for each group can be seen as follows.
## # A tibble: 6 x 3
## country start end
## <fct> <date> <date>
## 1 Austria 2020-02-22 2020-05-05
## 2 Belgium 2020-02-18 2020-05-05
## 3 Denmark 2020-02-21 2020-05-05
## 4 France 2020-02-07 2020-05-05
## 5 Germany 2020-02-15 2020-05-05
## 6 Italy 2020-01-27 2020-05-05
Here, the start dates have been heuristically chosen to be 30 days prior to observing 10 cumulative deaths in each country.
We are now ready to fit our first model. For this we return to the model
fitting function epim()
. The following command is used to instruct
epidemia to run Markov chains in parallel, rather than sequentially, if
multiple cores are detected.
options(mc.cores = parallel::detectCores())
Our call to epim()
is as follows. We use refresh = 0
to suppress printing output in this article, however, this should
not generally be used as such output is useful.
fm <- epim(rt = rt, inf = inf, obs = deaths, data = data, group_subset = "France", algorithm = "sampling", iter = 1e3, seed = 12345, refresh = 0)
The print method for epimodel
objects prints summary statistics for
model parameters. These are obtained from the sampled posterior distribution.
Parameter are displayed
according to which part of the model they belong to (transmission, observations,
infections). An estimate of the standard deviation, labeled MAD_SD
is
displayed. This is the median absolute deviation from the median, and is more
robust than naive estimates of the standard deviation for long-tailed
distributions.
print(fm)
##
## Rt regression parameters:
## ==========
## coefficients:
## Median MAD_SD
## R|(Intercept) 0.7 0.2
## R|lockdown -2.4 0.3
## R|public_events -0.4 0.3
##
## deaths regression parameters:
## ==========
## coefficients:
## Median MAD_SD
## deaths|(Intercept) 0.0 0.2
## deaths|reciprocal dispersion 10.4 0.4
##
## Infection model parameters:
## ==========
## Median MAD_SD
## seeds[France] 15.1 5.2
## seeds_aux 27.2 22.0
Alternatively, the summary method can be used. This gives quantiles of the posterior draws, and also displays some MCMC diagnostics.
summary(fm)
##
##
## Estimates:
## mean sd 10% 50% 90%
## R|(Intercept) 0.7 0.2 0.5 0.7 0.9
## R|lockdown -2.4 0.3 -2.8 -2.4 -2.1
## R|public_events -0.4 0.3 -0.8 -0.4 0.0
## deaths|(Intercept) 0.0 0.2 -0.3 0.0 0.2
## seeds[France] 16.0 5.8 9.4 15.1 24.1
## seeds_aux 38.2 35.3 8.5 27.2 80.7
## deaths|reciprocal dispersion 10.5 0.5 10.1 10.4 11.1
##
## MCMC diagnostics
## mcse Rhat n_eff
## R|(Intercept) 0.0 1.0 1221
## R|lockdown 0.0 1.0 1168
## R|public_events 0.0 1.0 967
## deaths|(Intercept) 0.0 1.0 1649
## seeds[France] 0.2 1.0 1362
## seeds_aux 1.1 1.0 1113
## deaths|reciprocal dispersion 0.0 1.0 2210
## log-posterior 0.1 1.0 708
Flaxman, Seth, Swapnil Mishra, Axel Gandy, H Juliette T Unwin, Thomas A Mellan, Helen Coupland, Charles Whittaker, et al. 2020. “Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe.” Nature. https://doi.org/10.1038/s41586-020-2405-7.
Hoffman, Matthew D., and Andrew Gelman. 2014. “The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15.
Kucukelbir, Alp, David M. Blei, Andrew Gelman, Rajesh Ranganath, and Dustin Tran. 2017. “Automatic Differentiation Variational Inference.” Journal of Machine Learning Research 18.
Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David M. Blei. 2015. “Automatic variational inference in Stan.” In Advances in Neural Information Processing Systems. Vols. 2015-January.