Data curation

The UK Biobank (synthetic) dataset

Throughout this project, we will use a synthetic version of the UK Biobank data. The UK Biobank is a large cohort data set composed of over 500,000 volunteers recruited across United Kingdom between 2006-2010. For this project, we will use data from 252,877 participants who have high quality lung function data to select participants with airflow obstruction (AO) (defined as FEV1/FVC < LLN). We will use AO as out outcome to determine the mortality risk in people with AO vs people without AO.

Essential Variables

Event (death) variables	Description	Coding	Outcome variable	Description	Coding
cofdmain	Main cause of death coded by International Classification of Disease v10 (ICD10) This variable will be used later in the course to stratify the analysis by type of death	0 = No event (alive/censored) 1 = Event (dead)	AO.fev1fvc	Airflow obstruction derived as: forced expiratory volume at 1 s/ forced vital capacity below the lower limit of normal (from NHANES equations) (FEV1/FVC<LLN)	0 = No airflow obstruction 1 = Airflow obstruction
Date of death	Date of participant’s death obtained from the National Death Registry	As character but must be changed into numeric format

Additional variables used in the multivariable (adjusted model)

Covariates name	Description	Coding
eid	Unique participant Identification Number for anonymised data	Interger
Age	Age of participant at recruitment (when they entered the study)	Interger
Sex	Sex of participant	Male, Female
bmi	Body Max Index of participant	Numeric
UKB_centre	UK Biobank Centre at which the participant was recruited	Character
Ethnicity	Genetically-derived ethnicity of the participant	White vs Other
Smoking_status	Smoking Status of the participant at recruitment	Never, current, previous

Data Curation

Note: the Surv() function in the in the {survival} package, which will be used in the Survival analysis chapter, accepts different formatting of the event date e.g. TRUE/FALSE, where TRUE is event and FALSE is censored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 is event and 1 is censored.

Please make sure that the event is properly formatted. In this example we will be using 0/1.

# Load Data set 

death <- read.table("data/synthetic_data_ReCoDe.txt", header=TRUE, sep="\t")

#Months need to be in number and not character
death$monthofbirth <- as.character(death$monthofbirth)
death$monthofbirth[death$monthofbirth == "January"] <- "01"
death$monthofbirth[death$monthofbirth == "February"] <- "02"
death$monthofbirth[death$monthofbirth == "March"] <- "03"
death$monthofbirth[death$monthofbirth == "April"] <- "04"
death$monthofbirth[death$monthofbirth == "May"] <- "05"
death$monthofbirth[death$monthofbirth == "June"] <- "06"
death$monthofbirth[death$monthofbirth == "July"] <- "07"
death$monthofbirth[death$monthofbirth == "August"] <- "08"
death$monthofbirth[death$monthofbirth == "September"] <- "09"
death$monthofbirth[death$monthofbirth == "October"] <- "10"
death$monthofbirth[death$monthofbirth == "November"] <- "11"
death$monthofbirth[death$monthofbirth == "December"] <- "12"

#Format of year XXXX- MonthXX- day XX (As a date)
death$dateofbirth <- paste0(death$yearofbirth,"-", death$monthofbirth, "-01")
death$dateofbirth <- as.Date(death$dateofbirth)
death$date_of_death <- as.Date(death$date_of_death)
death$now_cens <- as.Date("2023-01-01") #Change to "end of study" censoring date

Now that the dates are formatted, we need to calculate the difference between start and end dates in some units, in this example the time will be calculated in years.

As participants could enter the study at any point between 2006-2010, there is a delayed entry into the study. This type of data is know as left-truncated, and as we will define an “end date” the data will also be right-censored. We therefore use age as the time-scale to account for these.

death$timetoevent <- death$date_of_death - death$dateofbirth 
death$timetocens <- death$now_cens - death$dateofbirth 

#Select final time if not dead then censoring date
death$time <- ifelse(is.na(death$date_of_death), death$timetocens, death$timetoevent)

death$time_years <- death$time/365.25 #To have time in years

##Censoring of participants- do we know when they died? if so = 1 (Event occurs)
##Otherwise = 0 as we do not know when this could happen. 

death$allcause_death <- as.character(death$cofdmain)
death$allcause_death[is.na(death$allcause_death)] <- 0 #No event occured (ALIVE)
death$allcause_death[death$allcause_death > 0] <- 1 #Event occurs (Death)

Overview of the data

library(knitr)

head_data <- head(death[, c("Sex","Smoking_status", "allcause_death", "time_years")], 15)

# Create a table
kable(head_data, caption = "First 15 Observations of Survival Analysis Data")

First 15 Observations of Survival Analysis Data
Sex	Smoking_status	allcause_death	time_years
Female	Never	1	67.56742
Female	Never	0	58.25051
Male	Never	0	72.41889
Female	Never	0	64.75291
Female	Never	0	82.58453
Male	Previous	0	65.41821
Female	Previous	0	64.25188
Female	Never	0	79.67146
Male	Previous	0	81.50308
Female	Previous	0	67.08556
Female	Never	0	64.08487
Male	Never	0	84.00000
Male	Current	0	62.16564
Female	Never	0	74.58453
Female	Never	0	71.33470

How many participants have experienced the event? (e.g. dead)

library(dplyr)
library(tidyr)
library(knitr)

# Create a summary table
summary_table <- death %>%
  group_by(Sex, allcause_death) %>%
  summarise(count = n(), .groups = 'drop') %>%
  pivot_wider(names_from = Sex, values_from = count, values_fill = list(count = 0)) %>%
  mutate(Total = Male + Female)

# Display the summary table
kable(summary_table, caption = "Number of participants that have died by sex")

Number of participants that have died by sex
allcause_death	Female	Male	Total
0	142563	95574	238137
1	6986	7754	14740

How many participants have airflow obstruction (outcome)?

library(dplyr)
library(tidyr)
library(knitr)

summary_table <- death %>%
  group_by(Sex, AO.fev1fvc) %>%
  summarise(count = n(), .groups = 'drop') %>%
  pivot_wider(names_from = Sex, values_from = count, values_fill = list(count = 0)) %>%
  mutate(Total = Male + Female)

# Display the summary table
kable(summary_table, caption = "Number of participants with airflow obstruction by sex")

Number of participants with airflow obstruction by sex
AO.fev1fvc	Female	Male	Total
0	139183	91844	231027
1	10366	11484	21850

#save(death, file = "data/result.RData")