1 Model Implementation

Here we give a high-level overview of the workflow required for defining and fitting a model with epidemia. The primary model fitting function is epim(). This takes a model description and additional arguments relating to the fitting algorithm, and proceeds to fit the model using a precompiled Stan program. This is similar to the workflow for fitting Bayesian regression models with rstanarm. A key difference, however, is that the models fit by epidemia are generally complex, and are therefore inherently more difficult to specify. We simplify this process by taking a modular approach; models are defined through three distinct parts: transmission, infections and observations. These components of the model are defined with the functions epirt(), epiinf() and epiobs() respectively.

The package contains an example data set EuropeCovid which contains data on daily death counts from Covid-19 in 11 European Countries from February through May 2020, and a set of binary indicators of non-pharmaceutical interventions. This is used as an example throughout.

library(dplyr)
library(epidemia)
library(rstanarm)
data("EuropeCovid")

We begin by describing epim() in more detail, and then proceed to discuss the three modeling functions.

1.1 Model Fitting

epim() is the only model fitting function in epidemia. It has arguments rt, inf, and obs which expect a description of the transmission model, infection model and all observational models respectively. Together, these fully define the joint distribution of data and parameters. Each of these model components are described in terms of variables that are expected to live in a single data frame, data. This data frame must be compatible with the model components, in the sense that it holds all variables defined in these models. For our example, these variables are the following.

data <- EuropeCovid$data
colnames(data)

## [1] "country"                      "date"                        
## [3] "schools_universities"         "self_isolating_if_ill"       
## [5] "public_events"                "lockdown"                    
## [7] "social_distancing_encouraged" "deaths"                      
## [9] "pop"

The data argument is described in more detail in Section 1.5.

In addition to taking a model description and a data frame, epim() has various additional arguments which specify how the model should be fit. If algorithm = "sampling" then the model will be fit using Stan’s adaptive Hamiltonian Monte Carlo sampler (Hoffman and Gelman 2014). This is done internally by calling sampling() from rstan. If instead this is "meanfield" or "fullrank", then Stan’s Variational Bayes algorithms (Kucukelbir et al. 2015, 2017) are employed by calling vb() from rstan. Any unnamed arguments in the call to epim() are passed directly onto the rstan sampling function. epim() returns a fitted model object of class epimodel, which contains posterior samples from the model along with other useful objects.

In general, Hamiltonian Monte Carlo should be used for final inference. Nonetheless, this is often computationally demanding, and Variational Bayes can often be used fruitful for quickly iterating models. All arguments for epim() are described in Table 1.1.

Table 1.1: Formal arguments for the model fitting function `epim()`. The first three arguments listed below define the model to be fitted.
Argument	Description
`rt`	An object of class `epirt`, resulting from a call to `epirt()`. This defines the model for time-varying reproduction numbers $R_t$.
`inf`	An object of class `epiinf`, resulting from a call to `epiinf()`. This entirely defines the model for infections $i_t$.
`obs`	Either an object of class `epiobs`, or a list of such objects. Each of these define a model for an observation vector in `data`, and result from a call to `epiobs()`. Each element of the list defines a model for an observed variable.
`data`	A dataframe with all data required for fitting the model. This includes all observations and covariates specified in the model.
`algorithm`	One of `"sampling"`, `"meanfield"` or `"fullrank"`. This determines the rstan sampling function to use for fitting the model. `"sampling"` corresponds to HMC, while `"meanfield"` and `"fullrank"` are Variational Bayes algorithms.
`group_subset`	If specified, a character vector naming a subset of groups/populations to include in the model.
`prior_PD`	If `TRUE`, parameters are sampled from their prior distributions. This is useful for prior predictive checks. Defaults to `FALSE`.
…	Additional arguments to pass to the rstan function used to fit the model. If `algorithm = "sampling"`, then this function is `sampling()`. Otherwise `vb()` is used.

1.2 Transmission

epirt() defines the model for time-varying reproduction numbers, which was described in Section 1.4 of the model description article. Recall that these are modeled as a transformed linear predictor. epirt() has a formula argument which defines the linear predictor $\eta$, an argument link defining the link function g, and additional arguments to specify priors on parameters making up $\eta$.

A general R formula gives a symbolic description of a model. It takes the form y ~ model, where y is the response and model is a collection of terms separated by the + operator. model fully defines a linear predictor used to predict y. In this case, the “response” being modeled are reproduction numbers which are unobserved. epirt() therefore requires that the left hand side of the formula takes the form R(group, date), where group and date refer to variables representing the modeled populations and dates respectively. The right hand side can consist of fixed effects, random effects, and autocorrelation terms. For our example, a viable call to epirt() is the following.

rt <- epirt(formula = R(country, date) ~ 1 + lockdown + public_events,
            link = scaled_logit(7))

Here, two fixed effects are included which represent the effects of implementing lockdown and banning public events. These effects are assumed constant across countries. They could alternatively be partially pooled by using the term (lockdown + public_events | country). For information on how to interpret such terms, please see partial pooling. Using link = scaled_logit(7) lets the link function be the scaled logit link described by Equation (1.7) in the model description article, where $K = 7$ is the maximum possible value for reproduction numbers. For simplicity, we have omitted any prior arguments, however these should generally be specified explicitly. Please see prior for detailed information on how to use priors. All arguments for epirt() are listed in Table 1.1.

Table 1.2: Formal arguments for `epirt()`, which defines the model for $R_t$.
Argument	Description
`formula`	An object of class `formula` which determines the linear predictor $\eta$ for $R$. The left hand side must take the form `R(group, date)`, where `group` must be a factor vector indicating group membership (i.e. country, state, age cohort), and `date` must be a vector of class `Date`. This is syntactic sugar for the reproduction number in the given group at the give date.
`link`	The link function $g$. Can be `"log"`, `"identity"` or a call to `scaled_logit()`. Defaults to `"log"`.
`center`	If `TRUE`, covariates specified in `formula` are centered to have mean zero. All priors should then be interpreted as priors on the centered covariates.
`prior`	Same as in `stan_glm()` from rstanarm. Defines the prior on fixed effects $\beta$. Priors provided by rstanarm can be used, and additionally `shifted_gamma`. Note: if `autoscale = TRUE` in the call to the prior function, then automatic rescaling takes place.
`prior_intercept`	Same as in `stan_glm()` from rstanarm. Prior for the regression intercept $\beta_0$ (if it exists).
`prior\_covariance`	Same as in `stan_glmer()` from rstanarm. Defines the prior on the covariance matrix $\Sigma$. Only use if the `formula` has one or more terms of the form `(x \| y)`, in which case there are parameters to partially pool, i.e. $b$ has positive length.
`...`	Additional arguments to pass to `model.frame()` from stats.

1.3 Infections

The infection model is represented by epiinf(). In the most basic version, this defines the distribution of the generation time of the disease, the number of days for which to seed infections, and the prior distribution on seeded infections. These three parameters are controlled by the arguments gen, seed_days and prior_seeds respectively. A possible model is the following.

inf <- epiinf(gen = EuropeCovid$si, seed_days = 6L,
              prior_seeds = hexp(exponential(0.02)))

EuropeCovid$si is a numeric vector representing the distribution for the serial interval of Covid-19. There is an implicit assumption that the generation time can be approximated well by the serial interval. Seeds are modeled hierarchically, and are described by (1.4) and (1.5) in the model description. $\tau$ has been assigned an exponential prior with a mean of 50. Seeded infections are assumed to occur over a period of 6 days.

epiinf() has additional arguments that allow the user to extend the basic model. Using latent = TRUE replaces the renewal equation with Equation (1.8). Daily infections are then treated as latent parameters that are sampled along with other parameters. The family argument specifies the distribution $p(i'_t, d)$, while prior_aux defines the prior on the coefficient of dispersion $d$.

Recall from the model description that the infection process may be modified to explicitly account for changes in infection rates as the remaining susceptible population is depleted. In particular, the adjustment ensures that cumulative infections never breaches the population size. It can be employed by setting pop_adjust = TRUE and using the pop argument to point towards a static variable in the data frame giving the population size. All argument to epiinf() are described in Table 1.3.

Table 1.3: Formal arguments for `epiinf()`, which defines the infection model.
Argument	Description
`gen`	A numeric vector giving the probability mass function $g_k$ for the generation time of the disease (must be a probability vector).
`seed_days`	An integer giving the number of days $v + 1$ for which to seed infections. Defaults to 6L.
`prior_seeds`	Prior distribution on the seed parameter $i$. Defaults to `hexp(prior_aux = rstanarm::exponential(0.03))`.
`latent`	If `TRUE`, treat infections as latent parameters.
`family`	Specifies the family for the infection distribution $p(i'_t, d)$. Only used if `latent = TRUE`, and defaults to `"normal"`.
`prior_aux`	Prior on the auxiliary variable $d$ of $p(i'_t,d)$. This is either the variance-to-mean ratio or the coefficient of variation, depending on the value of `fixed_vtm`. Only used if `latent = TRUE`.
`fixed_vtm`	If `TRUE`, then $p(i'_t, d)$ has a fixed variance-to-mean ratio, i.e. variance is $\sigma^2 = d i'_t$; In this case, $d$ refers to the variance-to-mean ratio. Id `FALSE` then instead standard deviation is assumed proportional to the mean, in which case $d$ is the coefficient of variation. Only used if `latent = TRUE`.
`pop_adjust`	If `TRUE`, applies the population adjustment to the infection process.
`pops`	A character vector giving the population variable. Only used if `pop_adjust = TRUE`.
`prior_susc`	Prior on $S_{v-1} / P$, the initial susceptible population as a proportion of the population size. If `NULL`, this is assumed to be equal to 1 (i.e. everyone is initially susceptible). Otherwise, can be a call to `normal()` from rstanarm, which assigns a normal prior truncated to $[0,1]$. Only used if `pop_adjust = TRUE`.
`rm`	A character vector giving the variable corresponding to $v_t$, i.e. the proportion of $S_t$ to remove at time $t$. Only used if `pop_adjust = TRUE`.
`prior_rm_noise`	Prior on the parameter $\xi$, which controls noise around $v_t$. If `NULL`, no noise is added. Only used if `pop_adjust = TRUE`.

1.4 Observations

An observational model is defined by a call to epiobs(). In particular, this must also make explicit the model for the multipliers $\alpha_t$, and must also specify the coefficients $\pi_k$. epiobs() has a formula argument. The left hand side must indicate the observation vector to be modeled, while the right hand side defines a linear predictor for $\alpha_t$. The argument i2o plays a similar role to the gen argument in epiinf(), however it instead corresponds the vector $\pi$ in Equation (1.2).

Take for example the task of modeling daily deaths, which as we saw is a variable in data. A possible model is the following.

deaths <- epiobs(formula = deaths ~ 1, i2o = EuropeCovid$inf2death,
                 link = scaled_logit(0.02))

Here $\alpha_t$ corresponds to the infection fatality rate (IFR), and is modeled as an intercept transformed by the scaled-logit link. This implies that the IFR is constant over time and its value lies somewhere between 0% and 2%. If the prior on the intercept (specified by the prior_intercept argument) is chosen to be symmetric around zero, then the prior mean for the IFR is 1%. EuropeCovid$inf2death is a numeric simplex vector that gives the same delay distribution as used in Flaxman et al. (2020). This is a density function for a discretized mixture of Gamma random variables.

Additional arguments include family, which specifies the sampling distribution $p(y_t, \phi)$. There are also arguments allowing the user to control prior distributions for effects in the linear predictor for $\alpha_t$, and the prior on the auxiliary variable $\phi$. All arguments to epiobs() are shown in Table 1.4.

Table 1.4: Formal arguments for `epiobs()`. This defines a single observation model. Multiple such models can be used and passed to `epim()` in a list.
Argument	Description
`formula`	An object of class `formula` which determines the linear predictor for the ascertainment rate. The left hand side must define the response that is being modeled (i.e. the actual observations, not the latent ascertainments)
`i2o`	A numeric (probability) vector defining the probability mass function $\pi_k$ of the time from an infection to an observation.
`family`	A string representing the family of the sampling distribution $p(y_t,\phi)$. Can be one of `"poisson"`, `"neg_binom"`, `"quasi_poisson"`, `"normal"` or `"log_normal"`.
`link`	A string representing the link function used to transform the linear predictor. Can be one of `"logit"`, `"probit"`, `"cauchit"`, `"cloglog"`, `"identity"`. Defaults to `"logit"`.
`center`, `prior`, `prior_intercept`	same as in `epirt()`, described above.
`prior_aux`	The prior distribution for the auxiliary parameter $\phi$, if it exists. Only used if family is `"neg_binom"` (reciprocal dispersion), `"quasi_poisson"` (dispersion), `"normal"` (standard deviation) or `"log_normal"` (sigma parameter).
`...`	Additional arguments for `model.frame()` from stats.

1.5 Data

Before fitting our first model in Section 1.6, we elaborate on the data argument to epim(). Recall that this must contain all variables used in the transmission and infection models, and in all observational models. For our example, data looks like

head(data)

## # A tibble: 6 x 9
## # Groups:   country [1]
##   country date       schools_universit… self_isolating_i… public_events lockdown
##   <fct>   <date>                  <int>             <int>         <int>    <int>
## 1 Austria 2020-02-22                  0                 0             0        0
## 2 Austria 2020-02-23                  0                 0             0        0
## 3 Austria 2020-02-24                  0                 0             0        0
## 4 Austria 2020-02-25                  0                 0             0        0
## 5 Austria 2020-02-26                  0                 0             0        0
## 6 Austria 2020-02-27                  0                 0             0        0
## # … with 3 more variables: social_distancing_encouraged <int>, deaths <int>,
## #   pop <int>

The columns country and date define the region and time period corresponding to each of the remaining variables. epim() assumes that the first seeding day (i.e. the start of the epidemic) in each region is the first date found in the data frame. The last data found for each region is the final data at which the epidemic is simulated. It is up to the user to appropriately choose these dates. For our example, the first and last dates for each group can be seen as follows.

dates <- summarise(data, start = min(date), end = max(date))
head(dates)

## # A tibble: 6 x 3
##   country start      end       
##   <fct>   <date>     <date>    
## 1 Austria 2020-02-22 2020-05-05
## 2 Belgium 2020-02-18 2020-05-05
## 3 Denmark 2020-02-21 2020-05-05
## 4 France  2020-02-07 2020-05-05
## 5 Germany 2020-02-15 2020-05-05
## 6 Italy   2020-01-27 2020-05-05

Here, the start dates have been heuristically chosen to be 30 days prior to observing 10 cumulative deaths in each country.

1.6 A First Fit

We are now ready to fit our first model. For this we return to the model fitting function epim(). The following command is used to instruct epidemia to run Markov chains in parallel, rather than sequentially, if multiple cores are detected.

 options(mc.cores = parallel::detectCores())

Our call to epim() is as follows. We use refresh = 0 to suppress printing output in this article, however, this should not generally be used as such output is useful.

fm <- epim(rt = rt, inf = inf, obs = deaths, data = data,
           group_subset = "France", algorithm = "sampling", iter = 1e3,
           seed = 12345, refresh = 0)

The print method for epimodel objects prints summary statistics for model parameters. These are obtained from the sampled posterior distribution. Parameter are displayed according to which part of the model they belong to (transmission, observations, infections). An estimate of the standard deviation, labeled MAD_SD is displayed. This is the median absolute deviation from the median, and is more robust than naive estimates of the standard deviation for long-tailed distributions.

print(fm)

## 
## Rt regression parameters:
## ==========
## coefficients:
##                 Median MAD_SD
## R|(Intercept)    0.7    0.2  
## R|lockdown      -2.4    0.3  
## R|public_events -0.4    0.3  
## 
##  deaths  regression parameters:
## ==========
## coefficients:
##                              Median MAD_SD
## deaths|(Intercept)            0.0    0.2  
## deaths|reciprocal dispersion 10.4    0.4  
## 
## Infection model parameters:
## ==========
##               Median MAD_SD
## seeds[France] 15.1    5.2  
## seeds_aux     27.2   22.0

Alternatively, the summary method can be used. This gives quantiles of the posterior draws, and also displays some MCMC diagnostics.

summary(fm)

## 
## 
## Estimates:
##                                mean   sd   10%   50%   90%
## R|(Intercept)                 0.7    0.2  0.5   0.7   0.9 
## R|lockdown                   -2.4    0.3 -2.8  -2.4  -2.1 
## R|public_events              -0.4    0.3 -0.8  -0.4   0.0 
## deaths|(Intercept)            0.0    0.2 -0.3   0.0   0.2 
## seeds[France]                16.0    5.8  9.4  15.1  24.1 
## seeds_aux                    38.2   35.3  8.5  27.2  80.7 
## deaths|reciprocal dispersion 10.5    0.5 10.1  10.4  11.1 
## 
## MCMC diagnostics
##                              mcse Rhat n_eff
## R|(Intercept)                0.0  1.0  1221 
## R|lockdown                   0.0  1.0  1168 
## R|public_events              0.0  1.0   967 
## deaths|(Intercept)           0.0  1.0  1649 
## seeds[France]                0.2  1.0  1362 
## seeds_aux                    1.1  1.0  1113 
## deaths|reciprocal dispersion 0.0  1.0  2210 
## log-posterior                0.1  1.0   708

2 References

Flaxman, Seth, Swapnil Mishra, Axel Gandy, H Juliette T Unwin, Thomas A Mellan, Helen Coupland, Charles Whittaker, et al. 2020. “Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe.” Nature. https://doi.org/10.1038/s41586-020-2405-7.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15.

Kucukelbir, Alp, David M. Blei, Andrew Gelman, Rajesh Ranganath, and Dustin Tran. 2017. “Automatic Differentiation Variational Inference.” Journal of Machine Learning Research 18.

Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David M. Blei. 2015. “Automatic variational inference in Stan.” In Advances in Neural Information Processing Systems. Vols. 2015-January.