Mixed Models Analysis Using Large Datasets
Introduction
Welcome to The Random Side of R: A Mixed Model Adventure ReCoDe Project
This project is designed to guide you through the process of performing a mixed-effects model using large data sets in R. Mixed models—also known as hierarchical, multilevel, or random-effects models—are statistical approaches that allow us to account for both fixed effects (systematic, population-level factors) and random effects (group- or subject-level variability).
These models are particularly powerful when dealing with grouped, clustered, or repeated-measures data, such as individuals measured multiple times, schools within regions, or experiments conducted across sites. By explicitly modelling the correlation structure within the data, mixed models yield more accurate estimates and generalisable inferences.
The tutorial is structured into a series of sections introducing you to the basic concepts, analytic approaches, and applications of mixed models in R. You will learn how to structure your data, fit models, interpret both fixed and random effects, and visualise and validate results.
Project Structure
| Stage | Focus | R packages |
|---|---|---|
| 1. Curation | Importing, cleaning, and structuring grouped data | tidyverse |
| 2. Analysis | Fitting and interpreting mixed models | lme4, lmerTest |
| 3. Results | Plotting results from mixed models | sjPlot, ggeffects |
| Presenting your findings (tables, forest plots) | ggplot2, gt, forestplot |
|
| 4. Extension task | Generalised or multilevel structures | glmmTMB, brms |
Pre-requisites
In this ReCoDe project we will be using R so it would be very useful to take some courses offered by the Graduate School at Imperial College London, either as an introduction or a refresher:
Research Computing & Data Science Skills Courses
Learning outcomes
At the end of this project you should be able to:
- Understand the foundation of linear mixed-effects models (LMMs)
- Know when and why to use mixed models over simple linear regression
- Implement mixed models in R
- Interpret fixed and random effects, variance components, and model fit statistics
- Visualize and validate model
- Extend to generalized mixed models (GLMMs) and crossed/nested designs
What are Mixed Models?
The following lecture aims to give you an overview of what mixed models are. It comprehensively explains the differences between fixed vs random effect models as well as the differences between random intercepts and slopes. It gives great examples to put the terminology into perspective.
Introductory lecture on mixed models
Video credits: Methods in Experimental Ecology (YouTube channel).
Before we move on to organising our data and modelling, it’s worth pausing to clarify what we mean by mixed effects and why they matter for this type of dataset.
A mixed effects model includes two types of predictors:
Fixed effects:
These are the standard variables you see in a typical regression model: age, sex, BMI, smoking status, exposure category, etc. They are called fixed because the effect we estimate is assumed to be the same across all individuals and all sites. These are usually the variables we actually care about drawing conclusions from.Random effects –
These account for structured dependence in the data where observations are not fully independent. They typically correspond to a grouping factor such as:Repeated measures per person
Students within school
Patients within hospitals, or in our case
Participants nested inside study sites (countries)
Random effects let us acknowledge that people measured in the same site share something that cannot be explained by our covariates.
Why do we include site as a random effect?
In this example, each participant belongs to one of several study sites (e.g., UK, Colombia, South Africa). Even though everyone is measured with the same lung function test, each site can differ in ways that influence lung function, (e.g., differences in equipment calibration, technician training, recruitment patterns, environmental pollution, healthcare access…)
None of these are measured in our dataset, yet all of them can systematically impact the mean FEV1/FVC ratio at a site. This means that participants from the same site are more similar to each other than to participants from a different site (no independent observations).
If we ignore this clustering and use ordinary regression, those unmeasured site-level factors can impact our estimates of the fixed effects such as smoking or exposure category.
A random intercept for site takes in those differences by allowing each site to have its own baseline level of lung function.