Data Curation

It is very important to have your data ready and in the right format before you run any models. This section of the tutorial will allow you practice cleaning the data set.

About the dataset

The dataset contains simulated data for 2,000 adults from eight study sites (UK, USA, Colombia, India, South Africa, Poland, Vietnam, and Australia). It was generated to illustrate mixed-effects modeling and data cleaning workflows in respiratory epidemiology. Each participant has demographic, occupational, and lung function information designed to reflect a realistic example.

🧹 Let’s Get Cleaning!

Before jumping into mixed-modelling, we need to tackle some messy data. The dirty dataset (mixed_dirty.txt) mirrors the kind of inconsistencies found in real epidemiological studies, including: typos, missing entries, odd units, and empty spaces.

Variable	Description
id	Unique participant identifier (1–2000)
site	Study site (country: UK, USA, Colombia, India, South Africa, Poland, Vietnam, or Australia)
age	Participant age in years (contains missing values)
sex	Recorded sex (e.g., “M”, “F”, “male”, “Female”, “f”, “m”, “other”)
height_cm	Height in centimetres (may include blanks, “NA”, or blank spaces)
weight_kg	Weight in kilograms (some missing or inconsistent entries)
BMI	Body mass index derived from height and weight; includes rounding errors and missing values
smoking_status	Self-reported smoking behaviour with inconsistent text (e.g., “Never”, “never”, “CURRENT”, “ex”, “quit>1y”, “Unknown”)
pack_years	Approximate pack-years of smoking exposure (some missing or inconsistent with smoking status)
education	Educational attainment (free-text responses such as “primary”, “sec”, “College”, “uni”, or blanks)
job	Reported occupation, intentionally messy (e.g., “nuse”, “mining”, “constructor”, “teacher”, “office”)
exposure_cat	Occupational exposure category (“low” or “high_exposure”) with inconsistent naming
FEV1_mL	Forced Expiratory Volume in 1 second, in millilitres; includes outliers, missing values, and some negative entries
FVC_mL	Forced Vital Capacity, in millilitres; includes missing values

Note: The mixed_dirty.txt dataset includes deliberate inconsistencies and data quality issues for you to identify and correct during cleaning exercises (e.g., converting FEV1/FVC to litres, standardise categorical variables, and deriving exposure categories).

Your task is to explore, clean, and prepare it for analysis.

Install relevant packages and load your data

For this chapter we will need the following R packages: - tidyverse - stringr - dplyr

Note: Only install these if you have not already, otherwise go straight to library()

Load your data

dirty <- read.table("data/mixed_dirty.txt", sep = "\t", header=T)

#Inspect first few rows
head(dirty)

  id    site age    sex height_cm weight_kg  BMI   smoking_status pack_years
1  1      UK  77   male     181.5      97.9 29.7       Occasional       11.0
2  2 Vietnam  22      m     157.2      62.6 25.3          CURRENT       15.2
3  3  Poland  45      M     146.3      48.3 22.6                         0.0
4  4   India  49 Female     149.1      62.2 28.0          current        6.7
5  5   India  72      m     177.8      67.7 21.4          quit<1y        0.0
6  6 Vietnam  26  other     160.4      78.8 30.6 Smokes socially         6.9
  education       job  exposure_cat FEV1_mL FVC_mL
1  tertiary bartender           LOW    3469   4918
2       BSc    porter           LOW    4360   5107
3   primary   factory high_exposure    3154   4898
4 Secondary   factory high_exposure    1876   3283
5           security            hi     3002   5338
6   College    doctor           low    3010   3888

1. Fixing white spaces and NAs

It is common to encounter missing spaces in our data, but it is important to ensure that this is fixed before we start analysing our data. The str_squish() function trims these and replaces multiple spaces with a single one, this is an important first step before matching or grouping text variables (e.g., “nurse” vs “nurse”).

cleaning <- dirty %>%
  mutate(across(where(is.character), ~str_squish(.x)))

2. Convert numeric columns correctly

Sometimes numbers are stored as text (for example, “170” or “NA” as a string). We need to make sure we convert those variables to numeric format.

cleaning <- cleaning %>%
  mutate(across(c(age, height_cm, weight_kg, BMI, pack_years, FEV1_mL, FVC_mL),
                ~suppressWarnings(as.numeric(.x))))

3. Standardise categorical variables

Categorical variables often contain many variations of the same label (e.g., “female”, “FEMALE”, or “F”). Here, we make them consistent using case_when(). The same logic applies to similar variables such as smoking status and education. Standardising is a very important step that ensures data is grouped correctly.

cleaning <- cleaning %>%
  mutate(
    sex_clean = case_when(
      str_starts(tolower(sex), "m") ~ "Male",
      str_starts(tolower(sex), "f") ~ "Female",
      TRUE ~ "Other"
    ),
    smoking_status_clean = case_when(
      is.na(smoking_status) | smoking_status %in% c("", "NA", "Unknown", "unknown") ~ NA_character_,
      str_starts(tolower(smoking_status), "never") ~ "Never",
      str_detect(tolower(smoking_status), "current|smokes socially") & 
        !str_detect(tolower(smoking_status), "quit") ~ "Current",
      str_detect(tolower(smoking_status), "former|ex|quit") ~ "Former",
      tolower(smoking_status) == "occasional" ~ "Current",
      TRUE ~ NA_character_
    ),
    education_clean = case_when(
      is.na(education) | education %in% c("", "NA", "na", "none", "None") ~ "None",
      tolower(education) %in% c("primary") ~ "Primary",
      tolower(education) %in% c("secondary","sec","high school") ~ "Secondary",
      tolower(education) %in% c("college","uni","tertiary","masters","phd","bsc") ~ "Tertiary",
      TRUE ~ "None"
    )
  )

4. Clean job names (for this ReCoDe example)

In epidemiology studies, we often collect self-reported data like occupational data for our research questions. But coding job titles can be messy, (if not using standardised tools) resulting in misspellings (“nuse”), inconsistent forms (“mining” vs “miner”), or abbreviations. By converting everything to lowercase first and then mapping known variants to standard labels, we make sure occupations are consistent and ready to categorise later (e.g., into exposure groups).

cleaning <- cleaning %>%
  mutate(
    job_lc = tolower(job),
    job_clean = case_when(
      job_lc %in% c("nuse","nurse") ~ "nurse",
      job_lc %in% c("miner","mining") ~ "miner",
      job_lc %in% c("constructor","construction") ~ "construction worker",
      job_lc == "office" ~ "office worker",
      job_lc == "teacher" ~ "teacher",
      job_lc == "farmer" ~ "farmer",
      job_lc == "driver" ~ "driver",
      job_lc == "welder" ~ "welder",
      job_lc == "factory" ~ "factory worker",
      job_lc == "carpenter" ~ "carpenter",
      job_lc == "doctor" ~ "doctor",
      job_lc == "security" ~ "security",
      job_lc == "cleaner" ~ "cleaner",
      job_lc == "mechanic" ~ "mechanic",
      job_lc == "electrician" ~ "electrician",
      job_lc == "bartender" ~ "bartender",
      job_lc == "porter" ~ "porter",
      TRUE ~ job_lc
    )
  )

5. Create a new variable

It is very common to derive and create new variables when using large datasets. In this example, we classify jobs into high exposure or low exposure categories based on potential for dust, fumes, or other respiratory hazards. This step prepares the dataset for later regression or mixed-model analyses exploring exposure–lung function relationships.

cleaning <- cleaning %>%
  mutate(
    exposure_cat_clean = case_when(
      job_clean %in% c("miner","welder","factory worker","carpenter","mechanic","electrician") ~ "high_exposure",
      job_clean %in% c("office worker","teacher","doctor","nurse","construction worker", "security","cleaner","bartender","porter","farmer","driver") ~ "low_exposure",
      TRUE ~ "low_exposure"
    )
  )

6. Convert FEV1 and FVC to litres and calculate ratio

FEV1 and FVC are recorded in millilitres in the raw data, so we convert them to litres for analysis. The FEV1/FVC ratio is a key lung function indicator and values below normal suggest obstruction. During collection of this data some errors may have occurred, so it is important we remove any outliers at this stage that could bias our analyses later on.

cleaning <- cleaning %>%
  mutate(
    FEV1_L = FEV1_mL / 1000,
    FVC_L  = FVC_mL / 1000,
    FEV1_FVC = FEV1_L / FVC_L
  ) %>%
  mutate(
    FEV1_L = if_else(!is.na(FEV1_L) & FEV1_L <= 0, NA_real_, FEV1_L),
    FVC_L  = if_else(!is.na(FVC_L) & FVC_L <= 0, NA_real_, FVC_L),
    FEV1_FVC = if_else(!is.na(FEV1_FVC) & (FEV1_FVC <= 0 | FEV1_FVC > 1.2), NA_real_, FEV1_FVC)
  )

7. Convert FEV1 and FVC to litres and calculate ratio

This step applies the NHANES 2014 Hansen reference equations to estimate the Lower Limit of Normal (LLN) for FEV1/FVC, expressed in percent. For each participant:

Males use 74.5 − 0.12 × age
Females use 80.2 − 0.214 × age

We then compare their observed FEV1/FVC ratio (converted to a percentage) against this LLN. The new variable Below_LLN is coded as following:

1= airway obstruction
0= no airway obstruction

This binary flag is useful for logistic regression or for summarising the prevalence of obstruction across exposure or demographic groups. We will also use it when running different mixed-models changing the outcome variables.

cleaning <- cleaning %>%
  mutate(
    lln_pct_male = 74.5 - 0.12 * age,
    lln_pct_fem  = 80.2 - 0.214 * age,
    lln_pct = case_when(
      sex_clean == "Male"   ~ lln_pct_male,
      sex_clean == "Female" ~ lln_pct_fem,
      TRUE ~ ((74.5 - 0.12 * age) + (80.2 - 0.214 * age)) / 2  # fallback if sex not M/F
    ),
    # Convert ratio to percent for comparison, then flag below LLN
    Below_LLN = as.integer(!is.na(FEV1_FVC) & (FEV1_FVC * 100) < lln_pct)
  ) %>%
  select(-lln_pct_male, -lln_pct_fem)

✨ Clean Data: Ready for Analysis

Once cleaned, the dataset becomes mixed_clean.txt, with harmonised categories, derived lung function metrics, and an approximate lower limit of normal (LLN) for FEV1/FVC. This version can be used for teaching model fitting, random effects, and basic interpretation of respiratory epidemiology results.

library(knitr)
clean <- read.table("data/mixed_clean.txt", header = TRUE, sep = "\t")

kable(head(clean, 10))

id	site	age	sex	height_cm	weight_kg	BMI	smoking_status	pack_years	education	job	exposure_cat	FEV1_mL	FVC_mL	sex_clean	smoking_status_clean	education_clean	job_lc	job_clean	exposure_cat_clean	FEV1_L	FVC_L	FEV1_FVC	lln_pct	Below_LLN
1	UK	77	male	181.5	97.9	29.7	Occasional	11.0	tertiary	bartender	LOW	3469	4918	Male	Current	Tertiary	bartender	bartender	low_exposure	3.469	4.918	0.6486312	65.260	0
2	Vietnam	22	m	157.2	62.6	25.3	CURRENT	15.2	BSc	porter	LOW	4360	5107	Male	Current	Tertiary	porter	porter	low_exposure	4.360	5.107	0.8974913	71.860	0
3	Poland	45	M	146.3	48.3	22.6		0.0	primary	factory	high_exposure	3154	4898	Male		Primary	factory	factory worker	high_exposure	3.154	4.898	0.6525437	69.100	1
4	India	49	Female	149.1	62.2	28.0	current	6.7	Secondary	factory	high_exposure	1876	3283	Female	Current	Secondary	factory	factory worker	high_exposure	1.876	3.283	0.4948763	69.714	1
5	India	72	m	177.8	67.7	21.4	quit<1y	0.0		security	hi	3002	5338	Male	Former	None	security	security	low_exposure	3.002	5.338	0.4858307	65.860	1
6	Vietnam	26	other	160.4	78.8	30.6	Smokes socially	6.9	College	doctor	low	3010	3888	Other	Current	Tertiary	doctor	doctor	low_exposure	3.010	3.888	0.8179380	73.008	0
7	UK	47	male	165.2	63.8	23.4	NEVER	0.0	high school	nurse	low	4029	5274	Male	Never	Secondary	nurse	nurse	low_exposure	4.029	5.274	0.7071995	68.860	0
8	Poland	28	FEMALE	185.8	100.2	29.0	None	0.0	College	welder	high_exposure	3644	5218	Female		Tertiary	welder	welder	high_exposure	3.644	5.218	0.7069593	74.208	1
9	USA	71	F	184.4	88.1	25.9	quit<1y	13.1	Primary	cleaner	low	2750	3922	Female	Former	Primary	cleaner	cleaner	low_exposure	2.750	3.922	0.6687213	65.006	0
10	UK	60	F	172.2	79.0	26.6	NEVER	0.0	college	mechanic	high_exposure	2226	3583	Female	Never	Tertiary	mechanic	mechanic	high_exposure	2.226	3.583	0.5645303	67.360	1

Final variables in the Clean Dataset

Variable	Description
id	Unique participant identifier.
site	Study site (country); use as random intercept.
age	Age in years (numeric, cleaned).
sex	Standardised sex: Male, Female, or Other.
height_cm	Height in centimetres (numeric, cleaned).
weight_kg	Weight in kilograms (numeric, cleaned).
BMI	Body mass index derived from height and weight.
smoking_status	Recoded smoking category: Never, Former, Current, Occasional (or NA).
pack_years	Numeric smoking exposure (may be 0 for never smokers, NA if unknown).
education	Cleaned education level: None, Primary, Secondary, Tertiary.
job	Original reported occupation (as entered).
job_clean	Standardised occupation label (e.g., miner, office worker).
exposure_cat	Original exposure category from the raw file (left for reference).
exposure_cat_clean	Clean binary exposure category: low_exposure or high_exposure based on `job_clean`.
FEV1_mL	FEV1 in millilitres (numeric, cleaned).
FVC_mL	FVC in millilitres (numeric, cleaned).
FEV1_L	FEV1 converted to litres.
FVC_L	FVC converted to litres.
FEV1_FVC	Ratio FEV1/FVC (unitless).
LLN_FEV1FVC_approx	Age/sex–adjusted approximate LLN for FEV1/FVC (NHANES-inspired, for teaching).
Below_LLN	Indicator of obstruction: 1 = FEV1/FVC below LLN, 0 = otherwise.