Here we give a high-level overview of the workflow required for defining and fitting a model with
epidemia. The primary model fitting function is
epim(). This takes a model description and
additional arguments relating to the fitting algorithm, and proceeds to fit the model using a
precompiled Stan program. This allows model fitting to begin immediately as opposed to requiring compilation each time
epim() is called.
This is similar to the workflow for fitting Bayesian regression models
with rstanarm. A key difference, however, is that the models fit by epidemia are much more
complex, and are therefore inherently more difficult to specify. epidemia aims to simplify this process
by modularizing the model definition into three distinct parts: transmission,
infections and observations. These components of the model are defined with the functions
epidemia contains an example dataset
EuropeCovid which contains data
on daily death counts from Covid-19 in 11 European Countries from February through May 2020, and a set
of binary indicators of non-pharmaceutical interventions. This is
used as an example throughout.
We begin by describing
epim() in more detail, and then proceed to discuss the three modeling functions.
epim() is the only model fitting function in epidemia. It has arguments
expect a description of the transmission model, infection model and all observational models respectively.
Together, these fully define the joint distribution of data and parameters. Each of these model components are
described in terms of variables that are expected to live in a single dataframe,
data. This dataframe
must be compatible with the model components, in the sense that it holds all variables defined in these
models. For our example, these variables are the following.
data <- EuropeCovid$data colnames(data)
##  "country" "date" ##  "schools_universities" "self_isolating_if_ill" ##  "public_events" "lockdown" ##  "social_distancing_encouraged" "deaths" ##  "pop"
data argument is described in more detail in Section 6.
In addition to taking a model description and a dataframe,
epim() has various additional arguments which specify how the model should be
algorithm = "sampling" then the model
will be fit using Stan’s adaptive Hamiltonian Monte Carlo sampler. This
is done internally by calling
algorithm = "meanfield" or
algorithm = "fullrank", then Stan’s variational Bayes algorithms are used
instead, by calling
rstan::vb. Any unnamed
arguments in the call to
epim() are passed directly onto the rstan
epim() returns a fitted model object of class
"epimodel", which contains posterior samples from the model along with other
In general, the adaptive Hamiltonian Monte Carlo sampler
should be used for final inference. Nonetheless, fitting these models
using HMC is often computationally demanding, and variational Bayes can often be
fruitful for quickly iterating models. All arguments for
epim() are described in Table
An object of class
An object of class
Either an object of class
||A dataframe with all data required for fitting the model. This includes all observation variables and covariates specified in the models for the reproduction number and ascertainment rates.|
||If specified, a character vector naming a subset of regions to include in the model.|
Same as in
Additional arguments to pass to the
We now describe the three modeling functions in more detail, starting with
epirt() defines the model for time-varying reproduction numbers, which
was described in Section 4 of the model description article.
It has a
formula argument which defines the linear predictor \(\eta\), an
link defining the link function
g, and additional arguments to
specify priors on parameters making up \(\eta\).
A general R formula gives a symbolic description of a model. It takes the form
y ~ model, where
y is the response and
model is a collection of terms
separated by the
model fully defines a linear predictor used to
y. In this case, the “response” being modeled are reproduction numbers which are
epirt therefore requires that
the left hand side of the formula takes the form
R(group, date), where
date refer to variables representing the region and date respectively.
The right hand side can consist of fixed effects, random effects, and
autocorrelation terms. For our example, a viable call to
epirt() is the
In the above example, two fixed effects are included which represent the
effects of lockdown and public events. These effects are assumed constant across
countries. They could alternatively be partially
pooled by using the term
(lockdown + public_events | country).
For information on how to interpret such terms, please read the partial-pooling
link = scaled_logit(7) lets the link function be the
scaled logit link described in the model description article (Equation 4.2),
where \(K = 7\) is maximum possible value for reproduction numbers.
For simplicity, we have omitted any prior arguments, however these
should generally be specified explicitly. Please see
here for detailed information on how to use priors. All arguments
epirt() are listed in Table 3.1.
An object of class
The link function \(g\). Can be
Same as in
Same as in
Same as in
The model for infections is represented by
epiinf(). For the
basic version of the model, this defines the generation distribution
of the disease, the number of days for which to seed infections, and the
prior distribution on the parameter \(\tau\) (Equation 3.2). Recall that \(\tau\) is
the prior mean on daily seeded infections. These three parameters are
controlled by the arguments
For our example, a possible model is the following.
EuropeCovid$si is a numeric vector representing the probability mass
function for the serial interval of Covid-19. Here, we are implicitly assuming
that the generation distribution can be approximated well by the serial interval.
In this example, \(\tau\)
is given an exponential prior with a mean of 50. These seeds
occur over a period of 6 days.
epiinf() has additional arguments which allow the user to extend the basic
latent=TRUE replaces the renewal process (Equation 1.3) with
Equation 5.1. Daily infections are then treated as latent parameters that are
sampled along with other parameters. The
family argument then gives
the distribution for \(p(i'_t, d)\), while
prior_aux defines the prior on the
coefficient of dispersion \(d\).
Recall that one can adjust the infection process to explicitly
model changes in infection rates as the remaining susceptible population is
depleted. In particular, the adjustment ensures that cumulative infections
never breaches the initial susceptible population. The adjustment was described
in Section 5.3 of the model description article. It can be employed by
pop_adjust = TRUE and using the
susceptibles argument to point towards a variable in the dataframe which gives
the susceptible population at each point in time. All argument to
described in Table 4.1.
||A numeric vector giving the probability mass function \(g_k\) of the generation distribution. Must be a simplex vector, i.e. nonnegative and summing to 1.|
||An integer giving the number of seed days \(v + 1\). Defaults to 6L.|
Specifies the family for \(p(i'_t, d)\). Only used if
||Prior on the coefficient of variation \(d\).|
Prior distribution for the hyperparameter \(\tau\), which is the mean of the prior distribution for infection seeding. Defaults to
A character vector giving the name of the variable corresponding to the susceptible population over time. Only used if
Each observational model is given by a call to
epiobs(). In particular,
this must define the model for ascertainment rates \(\alpha\) and the time distribution
from infection to observation \(\pi\).
epiobs() has a formula argument. The left hand
side must define the observation vector \(Y\) which is to be modeled, while the
right hand side defines a linear predictor for the ascertainment rate \(\alpha\).
i2o plays a similar role to the
gen argument in
however it instead defines the vector \(\pi\) in Equation (1.2).
Suppose we wish to model daily
deaths, which as we saw is a variable in
One could use the following.
The above models the infection fatality rate, \(\alpha\), as an intercept transformed by the scaled logit
link. This implies that it is constant over time, can achieve a maximum value of
2%, and a minimum value of 0%. If
prior_intercept is chosen to be symmetric
around zero, then the prior mean of \(\alpha\) is 1%.
a numeric simplex vector that gives the same delay distribution as used in
Flaxman et al. (2020), which is a discretized p.d.f. of a mixture of Gamma random variables.
Additional arguments include
family, which specifies the sampling distribution
\(p(y_t, \phi)\). There are also arguments allowing the user to control
prior distributions for parameters describing \(\alpha\), and the prior on the
auxiliary variable \(\phi\). All arguments to
epiobs() are shown in Table
An object of class
||A numeric (simplex) vector defining the probability mass function \(\pi_k\) of the time distribution from infection to observation.|
A string representing the family of the sampling distribution \(p(y_t,\phi)\). Can be one of
A string representing the link function used to transform the linear predictor. Can be one of
same as in
The prior distribution for the auxiliary parameter \(\phi\), if it exists. Only used if family is
Additional arguments for
Recall that the
data argument to
epim() must contain all variables used in the transmission, infection and
observational models. For our example,
this looks like
## # A tibble: 899 x 9 ## # Groups: country  ## country date schools_universi… self_isolating_i… public_events lockdown ## <fct> <date> <int> <int> <int> <int> ## 1 Austria 2020-02-22 0 0 0 0 ## 2 Austria 2020-02-23 0 0 0 0 ## 3 Austria 2020-02-24 0 0 0 0 ## 4 Austria 2020-02-25 0 0 0 0 ## 5 Austria 2020-02-26 0 0 0 0 ## 6 Austria 2020-02-27 0 0 0 0 ## 7 Austria 2020-02-28 0 0 0 0 ## 8 Austria 2020-02-29 0 0 0 0 ## 9 Austria 2020-03-01 0 0 0 0 ## 10 Austria 2020-03-02 0 0 0 0 ## # … with 889 more rows, and 3 more variables: ## # social_distancing_encouraged <int>, deaths <int>, pop <int>
columns country and
date define the region and time period corresponding to each of the remaining
epim() assumes that the first seeding day (i.e. the start of the epidemic) in each region
is the first date found in the dataframe. The last date is the final date at which the epidemic is
simulated for the group. It is up to the user to appropriately choose these dates. For our example,
the first and last dates for each group can be seen as follows.
## # A tibble: 11 x 3 ## country start end ## * <fct> <date> <date> ## 1 Austria 2020-02-22 2020-05-05 ## 2 Belgium 2020-02-18 2020-05-05 ## 3 Denmark 2020-02-21 2020-05-05 ## 4 France 2020-02-07 2020-05-05 ## 5 Germany 2020-02-15 2020-05-05 ## 6 Italy 2020-01-27 2020-05-05 ## 7 Norway 2020-02-24 2020-05-05 ## 8 Spain 2020-02-09 2020-05-05 ## 9 Sweden 2020-02-18 2020-05-05 ## 10 Switzerland 2020-02-14 2020-05-05 ## 11 United_Kingdom 2020-02-13 2020-05-05
These start dates are 30 days prior to observing 10 cumulative deaths in each country.
We are ready to fit out first model. This is done by returning to the model
epim(). Our call is as follows. Note that we use
to suppress the algorithm printing output in this article - however this should be disabled
when you run models for yourself as this output is useful.
fm <- epim( rt = rt, inf = inf, obs = deaths, data = data, group_subset = "France", algorithm = "fullrank", tol_rel_obj = 1e-3, seed = 12345, refresh = 0 )
## Warning: Pareto k diagnostic value is 0.76. Resampling is unreliable. Increasing ## the number of draws or decreasing tol_rel_obj may help.
## Registered S3 methods overwritten by 'car': ## method from ## influence.merMod lme4 ## cooks.distance.influence.merMod lme4 ## dfbeta.influence.merMod lme4 ## dfbetas.influence.merMod lme4
plot_obs(fm, type = "deaths")
Flaxman, Seth, Swapnil Mishra, Axel Gandy, H Juliette T Unwin, Thomas A Mellan, Helen Coupland, Charles Whittaker, et al. 2020. “Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe.” Nature. https://doi.org/10.1038/s41586-020-2405-7.