Mixed Models Analysis Using Large Datasets

Author

Valentina Quintero Santofimio

Introduction

Welcome to The Random Side of R: A Mixed Model Adventure ReCoDe Project

This project is designed to guide you through the process of performing a mixed-effects model using large data sets in R. Mixed models—also known as hierarchical, multilevel, or random-effects models—are statistical approaches that allow us to account for both fixed effects (systematic, population-level factors) and random effects (group- or subject-level variability).

These models are particularly powerful when dealing with grouped, clustered, or repeated-measures data, such as individuals measured multiple times, schools within regions, or experiments conducted across sites. By explicitly modelling the correlation structure within the data, mixed models yield more accurate estimates and generalisable inferences.

The tutorial is structured into a series of sections introducing you to the basic concepts, analytic approaches, and applications of mixed models in R. You will learn how to structure your data, fit models, interpret both fixed and random effects, and visualise and validate results.

Project Structure

Stage Focus R packages
1. Curation Importing, cleaning, and structuring grouped data tidyverse
2. Analysis Fitting and interpreting mixed models lme4, lmerTest
3. Results Plotting results from mixed models sjPlot, ggeffects
Presenting your findings (tables, forest plots) ggplot2, gt, forestplot
4. Extension task Generalised or multilevel structures glmmTMB, brms

Pre-requisites

In this ReCoDe project we will be using R so it would be very useful to take some courses offered by the Graduate School at Imperial College London, either as an introduction or a refresher:

Research Computing & Data Science Skills Courses

Learning outcomes

At the end of this project you should be able to:

  • Understand the foundation of linear mixed-effects models (LMMs)
  • Know when and why to use mixed models over simple linear regression
  • Implement mixed models in R
  • Interpret fixed and random effects, variance components, and model fit statistics
  • Visualize and validate model
  • Extend to generalized mixed models (GLMMs) and crossed/nested designs

What are Mixed Models?


The following lecture aims to give you an overview of what mixed models are. It comprehensively explains the differences between fixed vs random effect models as well as the differences between random intercepts and slopes. It gives great examples to put the terminology into perspective.

Introductory lecture on mixed models

Video credits: Methods in Experimental Ecology (YouTube channel).

Before we move on to organising our data and modelling, it’s worth pausing to clarify what we mean by mixed effects and why they matter for this type of dataset.

A mixed effects model includes two types of predictors:

  • Fixed effects:
    These are the standard variables you see in a typical regression model: age, sex, BMI, smoking status, exposure category, etc. They are called fixed because the effect we estimate is assumed to be the same across all individuals and all sites. These are usually the variables we actually care about drawing conclusions from.

  • Random effects
    These account for structured dependence in the data where observations are not fully independent. They typically correspond to a grouping factor such as:

    • Repeated measures per person

    • Students within school

    • Patients within hospitals, or in our case

    • Participants nested inside study sites (countries)

Random effects let us acknowledge that people measured in the same site share something that cannot be explained by our covariates.

Why do we include site as a random effect?

In this example, each participant belongs to one of several study sites (e.g., UK, Colombia, South Africa). Even though everyone is measured with the same lung function test, each site can differ in ways that influence lung function, (e.g., differences in equipment calibration, technician training, recruitment patterns, environmental pollution, healthcare access…)

None of these are measured in our dataset, yet all of them can systematically impact the mean FEV1/FVC ratio at a site. This means that participants from the same site are more similar to each other than to participants from a different site (no independent observations).

If we ignore this clustering and use ordinary regression, those unmeasured site-level factors can impact our estimates of the fixed effects such as smoking or exposure category.

A random intercept for site takes in those differences by allowing each site to have its own baseline level of lung function.

Other resources