Here we give a highlevel overview of the workflow required for defining and fitting a model with epidemia. The primary model fitting function is epim()
. This takes a model description and additional arguments relating to the fitting algorithm, and proceeds to fit the model using a precompiled Stan program. This is similar to the workflow for fitting Bayesian regression models
with rstanarm. A key difference, however, is that the models fit by epidemia are generally complex, and are therefore inherently more difficult to specify. We simplify this process by taking a modular approach; models are defined through three distinct parts: transmission, infections and observations. These components of the model are defined with the functions epirt()
, epiinf()
and
epiobs()
respectively.
The package contains an example data set EuropeCovid
which contains data
on daily death counts from Covid19 in 11 European Countries from February through May 2020, and a set
of binary indicators of nonpharmaceutical interventions. This is
used as an example throughout.
We begin by describing epim()
in more detail, and then proceed to discuss the three modeling functions.
epim()
is the only model fitting function in epidemia. It has arguments rt
, inf
, and obs
which
expect a description of the transmission model, infection model and all observational models respectively.
Together, these fully define the joint distribution of data and parameters. Each of these model components are
described in terms of variables that are expected to live in a single data frame, data
. This data frame
must be compatible with the model components, in the sense that it holds all variables defined in these
models. For our example, these variables are the following.
data < EuropeCovid$data colnames(data)
## [1] "country" "date"
## [3] "schools_universities" "self_isolating_if_ill"
## [5] "public_events" "lockdown"
## [7] "social_distancing_encouraged" "deaths"
## [9] "pop"
The data
argument is described in more detail in Section 1.5.
In addition to taking a model description and a data frame,
epim()
has various additional arguments which specify how the model should be
fit. If algorithm = "sampling"
then the model
will be fit using Stan’s adaptive Hamiltonian Monte Carlo sampler (Hoffman and Gelman 2014). This
is done internally by calling sampling()
from rstan. If instead this is "meanfield"
or "fullrank"
, then Stan’s Variational Bayes
algorithms (Kucukelbir et al. 2015, 2017) are employed by calling vb()
from rstan. Any unnamed
arguments in the call to epim()
are passed directly onto the rstan
sampling function. epim()
returns a fitted model object of class
epimodel
, which contains posterior samples from the model along with other
useful objects.
In general, Hamiltonian Monte Carlo should be used for final inference. Nonetheless, this is often computationally demanding, and Variational Bayes can often be
used fruitful for quickly iterating models. All arguments for epim()
are described in Table
1.1.
Argument  Description 

rt

An object of class epirt , resulting from a call to epirt() . This defines the model for timevarying reproduction numbers \(R_t\).

inf

An object of class epiinf , resulting from a call to epiinf() . This entirely defines the model for infections \(i_t\).

obs

Either an object of class epiobs , or a list of such objects. Each of these define a model for an observation vector in data , and result from a call to epiobs() . Each element of the list defines a model for an observed variable.

data

A dataframe with all data required for fitting the model. This includes all observations and covariates specified in the model. 
algorithm

One of "sampling" , "meanfield" or "fullrank" . This determines the rstan sampling function to use for fitting the model. "sampling" corresponds to HMC, while "meanfield" and "fullrank" are Variational Bayes algorithms.

group_subset

If specified, a character vector naming a subset of groups/populations to include in the model. 
prior_PD

If TRUE , parameters are sampled from their prior distributions. This is useful for prior predictive checks. Defaults to FALSE .

… 
Additional arguments to pass to the rstan function used to fit the model. If algorithm = "sampling" , then this function is sampling() . Otherwise vb() is used.

epirt()
defines the model for timevarying reproduction numbers, which
was described in Section 1.4 of the model description article. Recall that these are modeled as a transformed linear predictor. epirt()
has a formula
argument which defines the linear predictor \(\eta\), an argument link
defining the link function g
, and additional arguments to specify priors on parameters making up \(\eta\).
A general R formula gives a symbolic description of a model. It takes the form
y ~ model
, where y
is the response and model
is a collection of terms
separated by the +
operator. model
fully defines a linear predictor used to
predict y
. In this case, the “response” being modeled are reproduction numbers which are
unobserved. epirt()
therefore requires that
the left hand side of the formula takes the form R(group, date)
, where group
and date
refer to variables representing the modeled populations and dates respectively.
The right hand side can consist of fixed effects, random effects, and
autocorrelation terms. For our example, a viable call to epirt()
is the
following.
rt < epirt(formula = R(country, date) ~ 1 + lockdown + public_events, link = scaled_logit(7))
Here, two fixed effects are included which represent the
effects of implementing lockdown and banning public events. These effects are assumed constant across
countries. They could alternatively be partially
pooled by using the term (lockdown + public_events  country)
.
For information on how to interpret such terms, please see partial pooling. Using link = scaled_logit(7)
lets the link function be the
scaled logit link described by Equation (1.7) in the model description article,
where \(K = 7\) is the maximum possible value for reproduction numbers.
For simplicity, we have omitted any prior arguments, however these
should generally be specified explicitly. Please see
prior for detailed information on how to use priors. All arguments
for epirt()
are listed in Table 1.1.
Argument  Description 

formula

An object of class formula which determines the linear predictor \(\eta\) for \(R\). The left hand side must take the form R(group, date) , where group must be a factor vector indicating group membership (i.e. country, state, age cohort), and date must be a vector of class Date . This is syntactic sugar for the reproduction number in the given group at the give date.

link

The link function \(g\). Can be "log" , "identity" or a call to scaled_logit() . Defaults to "log" .

center

If TRUE , covariates specified in formula are centered to have mean zero. All priors should then be interpreted as priors on the centered covariates.

prior

Same as in stan_glm() from rstanarm. Defines the prior on fixed effects \(\beta\). Priors provided by rstanarm can be used, and additionally shifted_gamma . Note: if autoscale = TRUE in the call to the prior function, then automatic rescaling takes place.

prior_intercept

Same as in stan_glm() from rstanarm. Prior for the regression intercept \(\beta_0\) (if it exists).

prior\_covariance

Same as in stan_glmer() from rstanarm. Defines the prior on the covariance matrix \(\Sigma\). Only use if the formula has one or more terms of the form (x  y) , in which case there are parameters to partially pool, i.e. \(b\) has positive length.

...

Additional arguments to pass to model.frame() from stats.

The infection model is represented by epiinf()
. In the most basic version, this defines the distribution of the generation time of the disease, the number of days for which to seed infections, and the prior distribution on seeded infections. These three parameters are
controlled by the arguments gen
, seed_days
and prior_seeds
respectively.
A possible model is the following.
inf < epiinf(gen = EuropeCovid$si, seed_days = 6L, prior_seeds = hexp(exponential(0.02)))
EuropeCovid$si
is a numeric vector representing the distribution for the serial interval of Covid19. There is an implicit assumption that the generation time can be approximated well by the serial interval. Seeds are modeled hierarchically, and are described by (1.4) and (1.5) in the model description. \(\tau\) has been assigned an exponential prior with a mean of 50. Seeded infections are assumed to occur over a period of 6 days.
epiinf()
has additional arguments that allow the user to extend the basic
model. Using latent = TRUE
replaces the renewal equation with
Equation (1.8). Daily infections are then treated as latent parameters that are
sampled along with other parameters. The family
argument specifies
the distribution \(p(i'_t, d)\), while prior_aux
defines the prior on the
coefficient of dispersion \(d\).
Recall from the model description that the infection process may be
modified to explicitly account for changes in infection rates as the remaining
susceptible population is depleted. In particular, the adjustment ensures that cumulative infections
never breaches the population size. It can be employed by
setting pop_adjust = TRUE
and using the
pop
argument to point towards a static variable in the data frame giving the population size. All argument to epiinf()
are
described in Table 1.3.
Argument  Description 

gen

A numeric vector giving the probability mass function \(g_k\) for the generation time of the disease (must be a probability vector). 
seed_days

An integer giving the number of days \(v + 1\) for which to seed infections. Defaults to 6L. 
prior_seeds

Prior distribution on the seed parameter \(i\). Defaults to hexp(prior_aux = rstanarm::exponential(0.03)) .

latent

If TRUE , treat infections as latent parameters.

family

Specifies the family for the infection distribution \(p(i'_t, d)\). Only used if latent = TRUE , and defaults to "normal" .

prior_aux

Prior on the auxiliary variable \(d\) of \(p(i'_t,d)\). This is either the variancetomean ratio or the coefficient of variation, depending on the value of fixed_vtm . Only used if latent = TRUE .

fixed_vtm

If TRUE , then \(p(i'_t, d)\) has a fixed variancetomean ratio, i.e. variance is \(\sigma^2 = d i'_t\); In this case, \(d\) refers to the variancetomean ratio. Id FALSE then instead standard deviation is assumed proportional to the mean, in which case \(d\) is the coefficient of variation. Only used if latent = TRUE .

pop_adjust

If TRUE , applies the population adjustment to the infection process.

pops

A character vector giving the population variable. Only used if pop_adjust = TRUE .

prior_susc

Prior on \(S_{v1} / P\), the initial susceptible population as a proportion of the population size. If NULL , this is assumed to be equal to 1 (i.e. everyone is initially susceptible). Otherwise, can be a call to normal() from rstanarm, which assigns a normal prior truncated to \([0,1]\). Only used if pop_adjust = TRUE .

rm

A character vector giving the variable corresponding to \(v_t\), i.e. the proportion of \(S_t\) to remove at time \(t\). Only used if pop_adjust = TRUE .

prior_rm_noise

Prior on the parameter \(\xi\), which controls noise around \(v_t\). If NULL , no noise is added. Only used if pop_adjust = TRUE .

An observational model is defined by a call to epiobs()
. In particular,
this must also make explicit the model for the multipliers \(\alpha_t\), and must
also specify the coefficients \(\pi_k\). epiobs()
has a formula
argument. The left hand
side must indicate the observation vector to be modeled, while the
right hand side defines a linear predictor for \(\alpha_t\). The argument i2o
plays a similar role to the gen
argument in epiinf()
,
however it instead corresponds the vector \(\pi\) in Equation (1.2).
Take for example the task of modeling daily deaths
, which as we saw is
a variable in data
. A possible model is the following.
deaths < epiobs(formula = deaths ~ 1, i2o = EuropeCovid$inf2death, link = scaled_logit(0.02))
Here \(\alpha_t\) corresponds to the infection fatality rate (IFR), and is modeled as an
intercept transformed by the scaledlogit link. This implies that the IFR is constant over time and its value lies somewhere between 0% and 2%. If the prior on
the intercept (specified by the prior_intercept
argument) is chosen to be
symmetric around zero, then the prior mean for the IFR is 1%. EuropeCovid$inf2death
is a numeric simplex vector that gives the same delay distribution as used in
Flaxman et al. (2020). This is a density function for a discretized mixture of Gamma
random variables.
Additional arguments include family
, which specifies the sampling distribution
\(p(y_t, \phi)\). There are also arguments allowing the user to control
prior distributions for effects in the linear predictor for \(\alpha_t\), and the prior on the
auxiliary variable \(\phi\). All arguments to epiobs()
are shown in Table
1.4.
Argument  Description 

formula

An object of class formula which determines the linear predictor for the ascertainment rate. The left hand side must define the response that is being modeled (i.e. the actual observations, not the latent ascertainments)

i2o

A numeric (probability) vector defining the probability mass function \(\pi_k\) of the time from an infection to an observation. 
family

A string representing the family of the sampling distribution \(p(y_t,\phi)\). Can be one of "poisson" , "neg_binom" , "quasi_poisson" , "normal" or "log_normal" .

link

A string representing the link function used to transform the linear predictor. Can be one of "logit" , "probit" , "cauchit" , "cloglog" , "identity" . Defaults to "logit" .

center ,
prior ,
prior_intercept

same as in epirt() , described above.

prior_aux

The prior distribution for the auxiliary parameter \(\phi\), if it exists. Only used if family is "neg_binom" (reciprocal dispersion), "quasi_poisson" (dispersion), "normal" (standard deviation) or "log_normal" (sigma parameter).

...

Additional arguments for model.frame() from stats.

Before fitting our first model in Section 1.6, we elaborate on
the data
argument to epim()
. Recall that this must contain all variables used in the transmission and infection models, and in all observational models. For our example,
data
looks like
head(data)
## # A tibble: 6 x 9
## # Groups: country [1]
## country date schools_universit… self_isolating_i… public_events lockdown
## <fct> <date> <int> <int> <int> <int>
## 1 Austria 20200222 0 0 0 0
## 2 Austria 20200223 0 0 0 0
## 3 Austria 20200224 0 0 0 0
## 4 Austria 20200225 0 0 0 0
## 5 Austria 20200226 0 0 0 0
## 6 Austria 20200227 0 0 0 0
## # … with 3 more variables: social_distancing_encouraged <int>, deaths <int>,
## # pop <int>
The columns country
and date
define the region and time period corresponding to each of the remaining
variables. epim()
assumes that the first seeding day (i.e. the start of the epidemic) in each region is the first date found in the data frame. The last data found for each region is the final data at which the epidemic is simulated. It is up to the user to appropriately choose these dates. For our example,
the first and last dates for each group can be seen as follows.
## # A tibble: 6 x 3
## country start end
## <fct> <date> <date>
## 1 Austria 20200222 20200505
## 2 Belgium 20200218 20200505
## 3 Denmark 20200221 20200505
## 4 France 20200207 20200505
## 5 Germany 20200215 20200505
## 6 Italy 20200127 20200505
Here, the start dates have been heuristically chosen to be 30 days prior to observing 10 cumulative deaths in each country.
We are now ready to fit our first model. For this we return to the model
fitting function epim()
. The following command is used to instruct
epidemia to run Markov chains in parallel, rather than sequentially, if
multiple cores are detected.
options(mc.cores = parallel::detectCores())
Our call to epim()
is as follows. We use refresh = 0
to suppress printing output in this article, however, this should
not generally be used as such output is useful.
fm < epim(rt = rt, inf = inf, obs = deaths, data = data, group_subset = "France", algorithm = "sampling", iter = 1e3, seed = 12345, refresh = 0)
The print method for epimodel
objects prints summary statistics for
model parameters. These are obtained from the sampled posterior distribution.
Parameter are displayed
according to which part of the model they belong to (transmission, observations,
infections). An estimate of the standard deviation, labeled MAD_SD
is
displayed. This is the median absolute deviation from the median, and is more
robust than naive estimates of the standard deviation for longtailed
distributions.
print(fm)
##
## Rt regression parameters:
## ==========
## coefficients:
## Median MAD_SD
## R(Intercept) 0.7 0.2
## Rlockdown 2.4 0.3
## Rpublic_events 0.4 0.3
##
## deaths regression parameters:
## ==========
## coefficients:
## Median MAD_SD
## deaths(Intercept) 0.0 0.2
## deathsreciprocal dispersion 10.4 0.4
##
## Infection model parameters:
## ==========
## Median MAD_SD
## seeds[France] 15.1 5.2
## seeds_aux 27.2 22.0
Alternatively, the summary method can be used. This gives quantiles of the posterior draws, and also displays some MCMC diagnostics.
summary(fm)
##
##
## Estimates:
## mean sd 10% 50% 90%
## R(Intercept) 0.7 0.2 0.5 0.7 0.9
## Rlockdown 2.4 0.3 2.8 2.4 2.1
## Rpublic_events 0.4 0.3 0.8 0.4 0.0
## deaths(Intercept) 0.0 0.2 0.3 0.0 0.2
## seeds[France] 16.0 5.8 9.4 15.1 24.1
## seeds_aux 38.2 35.3 8.5 27.2 80.7
## deathsreciprocal dispersion 10.5 0.5 10.1 10.4 11.1
##
## MCMC diagnostics
## mcse Rhat n_eff
## R(Intercept) 0.0 1.0 1221
## Rlockdown 0.0 1.0 1168
## Rpublic_events 0.0 1.0 967
## deaths(Intercept) 0.0 1.0 1649
## seeds[France] 0.2 1.0 1362
## seeds_aux 1.1 1.0 1113
## deathsreciprocal dispersion 0.0 1.0 2210
## logposterior 0.1 1.0 708
Flaxman, Seth, Swapnil Mishra, Axel Gandy, H Juliette T Unwin, Thomas A Mellan, Helen Coupland, Charles Whittaker, et al. 2020. “Estimating the effects of nonpharmaceutical interventions on COVID19 in Europe.” Nature. https://doi.org/10.1038/s4158602024057.
Hoffman, Matthew D., and Andrew Gelman. 2014. “The noUturn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15.
Kucukelbir, Alp, David M. Blei, Andrew Gelman, Rajesh Ranganath, and Dustin Tran. 2017. “Automatic Differentiation Variational Inference.” Journal of Machine Learning Research 18.
Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David M. Blei. 2015. “Automatic variational inference in Stan.” In Advances in Neural Information Processing Systems. Vols. 2015January.