It is very important to have your data ready and in the right format before you run any models. This section of the tutorial will allow you practice cleaning the data set.
About the dataset
The dataset contains simulated data for 2,000 adults from eight study sites (UK, USA, Colombia, India, South Africa, Poland, Vietnam, and Australia). It was generated to illustrate mixed-effects modeling and data cleaning workflows in respiratory epidemiology. Each participant has demographic, occupational, and lung function information designed to reflect a realistic example.
🧹 Let’s Get Cleaning!
Before jumping into mixed-modelling, we need to tackle some messy data. The dirty dataset (mixed_dirty.txt) mirrors the kind of inconsistencies found in real epidemiological studies, including: typos, missing entries, odd units, and empty spaces.
Variable
Description
id
Unique participant identifier (1–2000)
site
Study site (country: UK, USA, Colombia, India, South Africa, Poland, Vietnam, or Australia)
age
Participant age in years (contains missing values)
sex
Recorded sex (e.g., “M”, “F”, “male”, “Female”, “f”, “m”, “other”)
height_cm
Height in centimetres (may include blanks, “NA”, or blank spaces)
weight_kg
Weight in kilograms (some missing or inconsistent entries)
BMI
Body mass index derived from height and weight; includes rounding errors and missing values
smoking_status
Self-reported smoking behaviour with inconsistent text (e.g., “Never”, “never”, “CURRENT”, “ex”, “quit>1y”, “Unknown”)
pack_years
Approximate pack-years of smoking exposure (some missing or inconsistent with smoking status)
education
Educational attainment (free-text responses such as “primary”, “sec”, “College”, “uni”, or blanks)
Occupational exposure category (“low” or “high_exposure”) with inconsistent naming
FEV1_mL
Forced Expiratory Volume in 1 second, in millilitres; includes outliers, missing values, and some negative entries
FVC_mL
Forced Vital Capacity, in millilitres; includes missing values
Note: The mixed_dirty.txt dataset includes deliberate inconsistencies and data quality issues for you to identify and correct during cleaning exercises (e.g., converting FEV1/FVC to litres, standardise categorical variables, and deriving exposure categories).
Your task is to explore, clean, and prepare it for analysis.
Install relevant packages and load your data
For this chapter we will need the following R packages: - tidyverse - stringr - dplyr
Note: Only install these if you have not already, otherwise go straight to library()
Load your data
dirty <-read.table("data/mixed_dirty.txt", sep ="\t", header=T)#Inspect first few rowshead(dirty)
id site age sex height_cm weight_kg BMI smoking_status pack_years
1 1 UK 77 male 181.5 97.9 29.7 Occasional 11.0
2 2 Vietnam 22 m 157.2 62.6 25.3 CURRENT 15.2
3 3 Poland 45 M 146.3 48.3 22.6 0.0
4 4 India 49 Female 149.1 62.2 28.0 current 6.7
5 5 India 72 m 177.8 67.7 21.4 quit<1y 0.0
6 6 Vietnam 26 other 160.4 78.8 30.6 Smokes socially 6.9
education job exposure_cat FEV1_mL FVC_mL
1 tertiary bartender LOW 3469 4918
2 BSc porter LOW 4360 5107
3 primary factory high_exposure 3154 4898
4 Secondary factory high_exposure 1876 3283
5 security hi 3002 5338
6 College doctor low 3010 3888
1. Fixing white spaces and NAs
It is common to encounter missing spaces in our data, but it is important to ensure that this is fixed before we start analysing our data. The str_squish() function trims these and replaces multiple spaces with a single one, this is an important first step before matching or grouping text variables (e.g., “nurse” vs “nurse”).
Categorical variables often contain many variations of the same label (e.g., “female”, “FEMALE”, or “F”). Here, we make them consistent using case_when(). The same logic applies to similar variables such as smoking status and education. Standardising is a very important step that ensures data is grouped correctly.
In epidemiology studies, we often collect self-reported data like occupational data for our research questions. But coding job titles can be messy, (if not using standardised tools) resulting in misspellings (“nuse”), inconsistent forms (“mining” vs “miner”), or abbreviations. By converting everything to lowercase first and then mapping known variants to standard labels, we make sure occupations are consistent and ready to categorise later (e.g., into exposure groups).
It is very common to derive and create new variables when using large datasets. In this example, we classify jobs into high exposure or low exposure categories based on potential for dust, fumes, or other respiratory hazards. This step prepares the dataset for later regression or mixed-model analyses exploring exposure–lung function relationships.
6. Convert FEV1 and FVC to litres and calculate ratio
FEV1 and FVC are recorded in millilitres in the raw data, so we convert them to litres for analysis. The FEV1/FVC ratio is a key lung function indicator and values below normal suggest obstruction. During collection of this data some errors may have occurred, so it is important we remove any outliers at this stage that could bias our analyses later on.
7. Convert FEV1 and FVC to litres and calculate ratio
This step applies the NHANES 2014 Hansen reference equations to estimate the Lower Limit of Normal (LLN) for FEV1/FVC, expressed in percent. For each participant:
Males use 74.5 − 0.12 × age
Females use 80.2 − 0.214 × age
We then compare their observed FEV1/FVC ratio (converted to a percentage) against this LLN. The new variable Below_LLN is coded as following:
1= airway obstruction
0= no airway obstruction
This binary flag is useful for logistic regression or for summarising the prevalence of obstruction across exposure or demographic groups. We will also use it when running different mixed-models changing the outcome variables.
cleaning <- cleaning %>%mutate(lln_pct_male =74.5-0.12* age,lln_pct_fem =80.2-0.214* age,lln_pct =case_when( sex_clean =="Male"~ lln_pct_male, sex_clean =="Female"~ lln_pct_fem,TRUE~ ((74.5-0.12* age) + (80.2-0.214* age)) /2# fallback if sex not M/F ),# Convert ratio to percent for comparison, then flag below LLNBelow_LLN =as.integer(!is.na(FEV1_FVC) & (FEV1_FVC *100) < lln_pct) ) %>%select(-lln_pct_male, -lln_pct_fem)
✨ Clean Data: Ready for Analysis
Once cleaned, the dataset becomes mixed_clean.txt, with harmonised categories, derived lung function metrics, and an approximate lower limit of normal (LLN) for FEV1/FVC. This version can be used for teaching model fitting, random effects, and basic interpretation of respiratory epidemiology results.