Data Curation

It is very important to have your data ready and in the right format before you run any models. This section of the tutorial will allow you practice cleaning the data set.

About the dataset

The dataset contains simulated data for 2,000 adults from eight study sites (UK, USA, Colombia, India, South Africa, Poland, Vietnam, and Australia). It was generated to illustrate mixed-effects modeling and data cleaning workflows in respiratory epidemiology. Each participant has demographic, occupational, and lung function information designed to reflect a realistic example.

🧹 Let’s Get Cleaning!

Before jumping into mixed-modelling, we need to tackle some messy data. The dirty dataset (mixed_dirty.txt) mirrors the kind of inconsistencies found in real epidemiological studies, including: typos, missing entries, odd units, and empty spaces.

Variable Description
id Unique participant identifier (1–2000)
site Study site (country: UK, USA, Colombia, India, South Africa, Poland, Vietnam, or Australia)
age Participant age in years (contains missing values)
sex Recorded sex (e.g., “M”, “F”, “male”, “Female”, “f”, “m”, “other”)
height_cm Height in centimetres (may include blanks, “NA”, or blank spaces)
weight_kg Weight in kilograms (some missing or inconsistent entries)
BMI Body mass index derived from height and weight; includes rounding errors and missing values
smoking_status Self-reported smoking behaviour with inconsistent text (e.g., “Never”, “never”, “CURRENT”, “ex”, “quit>1y”, “Unknown”)
pack_years Approximate pack-years of smoking exposure (some missing or inconsistent with smoking status)
education Educational attainment (free-text responses such as “primary”, “sec”, “College”, “uni”, or blanks)
job Reported occupation, intentionally messy (e.g., “nuse”, “mining”, “constructor”, “teacher”, “office”)
exposure_cat Occupational exposure category (“low” or “high_exposure”) with inconsistent naming
FEV1_mL Forced Expiratory Volume in 1 second, in millilitres; includes outliers, missing values, and some negative entries
FVC_mL Forced Vital Capacity, in millilitres; includes missing values

Note: The mixed_dirty.txt dataset includes deliberate inconsistencies and data quality issues for you to identify and correct during cleaning exercises (e.g., converting FEV1/FVC to litres, standardise categorical variables, and deriving exposure categories).

Your task is to explore, clean, and prepare it for analysis.

Install relevant packages and load your data

For this chapter we will need the following R packages: - tidyverse - stringr - dplyr

Note: Only install these if you have not already, otherwise go straight to library()

Load your data

dirty <- read.table("data/mixed_dirty.txt", sep = "\t", header=T)

#Inspect first few rows
head(dirty)
  id    site age    sex height_cm weight_kg  BMI   smoking_status pack_years
1  1      UK  77   male     181.5      97.9 29.7       Occasional       11.0
2  2 Vietnam  22      m     157.2      62.6 25.3          CURRENT       15.2
3  3  Poland  45      M     146.3      48.3 22.6                         0.0
4  4   India  49 Female     149.1      62.2 28.0          current        6.7
5  5   India  72      m     177.8      67.7 21.4          quit<1y        0.0
6  6 Vietnam  26  other     160.4      78.8 30.6 Smokes socially         6.9
  education       job  exposure_cat FEV1_mL FVC_mL
1  tertiary bartender           LOW    3469   4918
2       BSc    porter           LOW    4360   5107
3   primary   factory high_exposure    3154   4898
4 Secondary   factory high_exposure    1876   3283
5           security            hi     3002   5338
6   College    doctor           low    3010   3888

1. Fixing white spaces and NAs

It is common to encounter missing spaces in our data, but it is important to ensure that this is fixed before we start analysing our data. The str_squish() function trims these and replaces multiple spaces with a single one, this is an important first step before matching or grouping text variables (e.g., “nurse” vs “nurse”).

cleaning <- dirty %>%
  mutate(across(where(is.character), ~str_squish(.x)))

2. Convert numeric columns correctly

Sometimes numbers are stored as text (for example, “170” or “NA” as a string). We need to make sure we convert those variables to numeric format.

cleaning <- cleaning %>%
  mutate(across(c(age, height_cm, weight_kg, BMI, pack_years, FEV1_mL, FVC_mL),
                ~suppressWarnings(as.numeric(.x))))

3. Standardise categorical variables

Categorical variables often contain many variations of the same label (e.g., “female”, “FEMALE”, or “F”). Here, we make them consistent using case_when(). The same logic applies to similar variables such as smoking status and education. Standardising is a very important step that ensures data is grouped correctly.

cleaning <- cleaning %>%
  mutate(
    sex_clean = case_when(
      str_starts(tolower(sex), "m") ~ "Male",
      str_starts(tolower(sex), "f") ~ "Female",
      TRUE ~ "Other"
    ),
    smoking_status_clean = case_when(
      is.na(smoking_status) | smoking_status %in% c("", "NA", "Unknown", "unknown") ~ NA_character_,
      str_starts(tolower(smoking_status), "never") ~ "Never",
      str_detect(tolower(smoking_status), "current|smokes socially") & 
        !str_detect(tolower(smoking_status), "quit") ~ "Current",
      str_detect(tolower(smoking_status), "former|ex|quit") ~ "Former",
      tolower(smoking_status) == "occasional" ~ "Current",
      TRUE ~ NA_character_
    ),
    education_clean = case_when(
      is.na(education) | education %in% c("", "NA", "na", "none", "None") ~ "None",
      tolower(education) %in% c("primary") ~ "Primary",
      tolower(education) %in% c("secondary","sec","high school") ~ "Secondary",
      tolower(education) %in% c("college","uni","tertiary","masters","phd","bsc") ~ "Tertiary",
      TRUE ~ "None"
    )
  )

4. Clean job names (for this ReCoDe example)

In epidemiology studies, we often collect self-reported data like occupational data for our research questions. But coding job titles can be messy, (if not using standardised tools) resulting in misspellings (“nuse”), inconsistent forms (“mining” vs “miner”), or abbreviations. By converting everything to lowercase first and then mapping known variants to standard labels, we make sure occupations are consistent and ready to categorise later (e.g., into exposure groups).

cleaning <- cleaning %>%
  mutate(
    job_lc = tolower(job),
    job_clean = case_when(
      job_lc %in% c("nuse","nurse") ~ "nurse",
      job_lc %in% c("miner","mining") ~ "miner",
      job_lc %in% c("constructor","construction") ~ "construction worker",
      job_lc == "office" ~ "office worker",
      job_lc == "teacher" ~ "teacher",
      job_lc == "farmer" ~ "farmer",
      job_lc == "driver" ~ "driver",
      job_lc == "welder" ~ "welder",
      job_lc == "factory" ~ "factory worker",
      job_lc == "carpenter" ~ "carpenter",
      job_lc == "doctor" ~ "doctor",
      job_lc == "security" ~ "security",
      job_lc == "cleaner" ~ "cleaner",
      job_lc == "mechanic" ~ "mechanic",
      job_lc == "electrician" ~ "electrician",
      job_lc == "bartender" ~ "bartender",
      job_lc == "porter" ~ "porter",
      TRUE ~ job_lc
    )
  )

5. Create a new variable

It is very common to derive and create new variables when using large datasets. In this example, we classify jobs into high exposure or low exposure categories based on potential for dust, fumes, or other respiratory hazards. This step prepares the dataset for later regression or mixed-model analyses exploring exposure–lung function relationships.

cleaning <- cleaning %>%
  mutate(
    exposure_cat_clean = case_when(
      job_clean %in% c("miner","welder","factory worker","carpenter","mechanic","electrician") ~ "high_exposure",
      job_clean %in% c("office worker","teacher","doctor","nurse","construction worker", "security","cleaner","bartender","porter","farmer","driver") ~ "low_exposure",
      TRUE ~ "low_exposure"
    )
  )

6. Convert FEV1 and FVC to litres and calculate ratio

FEV1 and FVC are recorded in millilitres in the raw data, so we convert them to litres for analysis. The FEV1/FVC ratio is a key lung function indicator and values below normal suggest obstruction. During collection of this data some errors may have occurred, so it is important we remove any outliers at this stage that could bias our analyses later on.

cleaning <- cleaning %>%
  mutate(
    FEV1_L = FEV1_mL / 1000,
    FVC_L  = FVC_mL / 1000,
    FEV1_FVC = FEV1_L / FVC_L
  ) %>%
  mutate(
    FEV1_L = if_else(!is.na(FEV1_L) & FEV1_L <= 0, NA_real_, FEV1_L),
    FVC_L  = if_else(!is.na(FVC_L) & FVC_L <= 0, NA_real_, FVC_L),
    FEV1_FVC = if_else(!is.na(FEV1_FVC) & (FEV1_FVC <= 0 | FEV1_FVC > 1.2), NA_real_, FEV1_FVC)
  )

7. Convert FEV1 and FVC to litres and calculate ratio

This step applies the NHANES 2014 Hansen reference equations to estimate the Lower Limit of Normal (LLN) for FEV1/FVC, expressed in percent. For each participant:

  • Males use 74.5 − 0.12 × age

  • Females use 80.2 − 0.214 × age

We then compare their observed FEV1/FVC ratio (converted to a percentage) against this LLN. The new variable Below_LLN is coded as following:

  • 1= airway obstruction

  • 0= no airway obstruction

This binary flag is useful for logistic regression or for summarising the prevalence of obstruction across exposure or demographic groups. We will also use it when running different mixed-models changing the outcome variables.

cleaning <- cleaning %>%
  mutate(
    lln_pct_male = 74.5 - 0.12 * age,
    lln_pct_fem  = 80.2 - 0.214 * age,
    lln_pct = case_when(
      sex_clean == "Male"   ~ lln_pct_male,
      sex_clean == "Female" ~ lln_pct_fem,
      TRUE ~ ((74.5 - 0.12 * age) + (80.2 - 0.214 * age)) / 2  # fallback if sex not M/F
    ),
    # Convert ratio to percent for comparison, then flag below LLN
    Below_LLN = as.integer(!is.na(FEV1_FVC) & (FEV1_FVC * 100) < lln_pct)
  ) %>%
  select(-lln_pct_male, -lln_pct_fem)

✨ Clean Data: Ready for Analysis

Once cleaned, the dataset becomes mixed_clean.txt, with harmonised categories, derived lung function metrics, and an approximate lower limit of normal (LLN) for FEV1/FVC. This version can be used for teaching model fitting, random effects, and basic interpretation of respiratory epidemiology results.

library(knitr)
clean <- read.table("data/mixed_clean.txt", header = TRUE, sep = "\t")

kable(head(clean, 10))
id site age sex height_cm weight_kg BMI smoking_status pack_years education job exposure_cat FEV1_mL FVC_mL sex_clean smoking_status_clean education_clean job_lc job_clean exposure_cat_clean FEV1_L FVC_L FEV1_FVC lln_pct Below_LLN
1 UK 77 male 181.5 97.9 29.7 Occasional 11.0 tertiary bartender LOW 3469 4918 Male Current Tertiary bartender bartender low_exposure 3.469 4.918 0.6486312 65.260 0
2 Vietnam 22 m 157.2 62.6 25.3 CURRENT 15.2 BSc porter LOW 4360 5107 Male Current Tertiary porter porter low_exposure 4.360 5.107 0.8974913 71.860 0
3 Poland 45 M 146.3 48.3 22.6 0.0 primary factory high_exposure 3154 4898 Male Primary factory factory worker high_exposure 3.154 4.898 0.6525437 69.100 1
4 India 49 Female 149.1 62.2 28.0 current 6.7 Secondary factory high_exposure 1876 3283 Female Current Secondary factory factory worker high_exposure 1.876 3.283 0.4948763 69.714 1
5 India 72 m 177.8 67.7 21.4 quit<1y 0.0 security hi 3002 5338 Male Former None security security low_exposure 3.002 5.338 0.4858307 65.860 1
6 Vietnam 26 other 160.4 78.8 30.6 Smokes socially 6.9 College doctor low 3010 3888 Other Current Tertiary doctor doctor low_exposure 3.010 3.888 0.8179380 73.008 0
7 UK 47 male 165.2 63.8 23.4 NEVER 0.0 high school nurse low 4029 5274 Male Never Secondary nurse nurse low_exposure 4.029 5.274 0.7071995 68.860 0
8 Poland 28 FEMALE 185.8 100.2 29.0 None 0.0 College welder high_exposure 3644 5218 Female Tertiary welder welder high_exposure 3.644 5.218 0.7069593 74.208 1
9 USA 71 F 184.4 88.1 25.9 quit<1y 13.1 Primary cleaner low 2750 3922 Female Former Primary cleaner cleaner low_exposure 2.750 3.922 0.6687213 65.006 0
10 UK 60 F 172.2 79.0 26.6 NEVER 0.0 college mechanic high_exposure 2226 3583 Female Never Tertiary mechanic mechanic high_exposure 2.226 3.583 0.5645303 67.360 1

Final variables in the Clean Dataset

Variable Description
id Unique participant identifier.
site Study site (country); use as random intercept.
age Age in years (numeric, cleaned).
sex Standardised sex: Male, Female, or Other.
height_cm Height in centimetres (numeric, cleaned).
weight_kg Weight in kilograms (numeric, cleaned).
BMI Body mass index derived from height and weight.
smoking_status Recoded smoking category: Never, Former, Current, Occasional (or NA).
pack_years Numeric smoking exposure (may be 0 for never smokers, NA if unknown).
education Cleaned education level: None, Primary, Secondary, Tertiary.
job Original reported occupation (as entered).
job_clean Standardised occupation label (e.g., miner, office worker).
exposure_cat Original exposure category from the raw file (left for reference).
exposure_cat_clean Clean binary exposure category: low_exposure or high_exposure based on job_clean.
FEV1_mL FEV1 in millilitres (numeric, cleaned).
FVC_mL FVC in millilitres (numeric, cleaned).
FEV1_L FEV1 converted to litres.
FVC_L FVC converted to litres.
FEV1_FVC Ratio FEV1/FVC (unitless).
LLN_FEV1FVC_approx Age/sex–adjusted approximate LLN for FEV1/FVC (NHANES-inspired, for teaching).
Below_LLN Indicator of obstruction: 1 = FEV1/FVC below LLN, 0 = otherwise.