5  Data Wrangling

In this chapter, we delve into the manipulation of data in the form of a data frame or tibble. In so doing, we will introduce the tidyverse package and the various verbs (function) it provides.

The tidyverse package is not just a single package but a composite of a group of packages. These include among others the dplyr package. Most of the function we will be employing in this chapter comes from dplyr.

We begin by reading in the blood_donors.xls

Code
df_blood <- 
    readxl::read_xls("C:/Dataset/blood_donors_1.xls")

df_blood %>% 
    head(10)
idhbhctsexbldgrppdonor
110.531.8MaleO3
211.937.2MaleAB0
31  26  MaleA1
48.926.8MaleA3
57.824.2MaleA2
610  30.9MaleB1
710.433.9MaleB0
811.335  MaleO1
916.4  MaleAB1
1014.443.6MaleAB1

The output shows we have a 25-row and 6-column tibble.

5.1 Renaming variables

Below we rename the variables hb to hemog and id to studyid using he rename function, and then show the first 5 records with the head function.

Code
df_blood %>% 
    rename(hemog = hb, studyid = id) %>% 
    head(5)
studyidhemoghctsexbldgrppdonor
110.531.8MaleO3
211.937.2MaleAB0
31  26  MaleA1
48.926.8MaleA3
57.824.2MaleA2

5.2 Sorting data

Below we use the arrange function to sort the bldgrp in ascending order and hb by descending order.

Code
df_blood %>% 
    arrange(bldgrp, desc(hb)) %>% 
    head(10)
idhbhctsexbldgrppdonor
179.830.5FemaleA4
219.128  A3
48.926.8MaleA3
57.824.2MaleA2
31  26  MaleA1
916.4  MaleAB1
1014.443.6MaleAB1
1612.799  FemaleAB0
2412.338.2AB2
1412.236.8FemaleAB1

5.3 Subsetting data

In this subsection, we demonstrate the use of the filter and select function to select specific records and variables in a tibble. Below we filter to select all records with hb > 12g/dl and keep only the id, hb and sex columns.

Code
df_blood %>% 
    filter(hb > 12) %>% 
    select(id, hb, sex)
idhbsex
916.4Male
1014.4Male
1412.2Female
1416.4Female
1612.7Female
2412.3

5.4 Generating new variables

To generate new variables we use the mutate function. Based on our knowledge that the hematocrit is approximately three times the haemoglobin we generate a new variable, hb_from_hct.

Code
df_blood %>% 
    mutate(hb_from_hct = hct/3) %>% 
    head(10)
idhbhctsexbldgrppdonorhb_from_hct
110.531.8MaleO310.6 
211.937.2MaleAB012.4 
31  26  MaleA18.67
48.926.8MaleA38.93
57.824.2MaleA28.07
610  30.9MaleB110.3 
710.433.9MaleB011.3 
811.335  MaleO111.7 
916.4  MaleAB1   
1014.443.6MaleAB114.5 

5.5 Aggregating data

Data can be aggregated in R using the summarize function. Below we determine the mean and standard deviation of the haemoglobin for the patient in the data.

Code
df_blood %>% 
    summarize(mean_hb = mean(hb), sd_hb = sd(hb))
mean_hbsd_hb
112.89

Grouping the data by the “bldgrp” before the aggregation yields the aggregated means and standard deviations for the various blood groups.

Code
df_blood %>% 
    group_by(bldgrp) %>% 
    summarize(mean_hb = mean(hb), sd_hb = sd(hb))
bldgrpmean_hbsd_hb
A7.323.61 
AB13.1 1.69 
B10.2 0.283
O11   0.427
P16.4     

5.6 Reshaping data

In longitudinal studies, data is captured from the same individual repeatedly. Such data is recorded either in long or wide formats. A typical example of a data frame in the long form is bpB below.

Code
bp_long <- read_csv(
    file = "C:/Dataset/bp_long.txt",
    col_names = TRUE, 
    col_types = c("c", "c", "i"))

bp_long
idmeasuresbp
B01sbp1141
B01sbp2137
B02sbp1155
B02sbp2153
B03sbp1153

In this format, each visit or round of data taking is captured as a new row, but with the appropriate study ID and period of record, captured as the variable measure above. Measurement of systolic blood pressure on day 1 is indicated by sbp1 in the measure variable. Day 2 measurements are indicated as sbp2.

The wide format of the same data can be obtained as below.

Code
bp_wide <- 
    bp_long %>% 
    pivot_wider(
        id_cols = id, 
        names_from = measure, 
        values_from = sbp)

bp_wide
idsbp1sbp2
B01141137
B02155153
B03153

Here, each study participant’s record for the whole study is on one row of the data and the different measurements of systolic blood pressure are captured as different variables. Next, we convert the wide back to the long format.

Code
bp_wide %>% 
    pivot_longer(
        cols = c(sbp1, sbp2),
        names_to = "time",
        values_to = "syst_bp")
idtimesyst_bp
B01sbp1141
B01sbp2137
B02sbp1155
B02sbp2153
B03sbp1153
B03sbp2

5.7 Combining data

In a study to determine the change in weight of athletes running a marathon, data about the athletes were obtained by the investigators. Since the marathon starts in town A and ends in town B, the investigators decided to weigh the athletes just before starting the race. Here they took records of the ID of the athlete’s sid, sex, age and weight at the start (wgtst). The records of five of these athletes are in the data marathonA. At the end point of the marathon, another member of the investigation team recorded their IDs (eid), weight upon completion (wgtend) and the time it took the athletes to complete the marathon (dura).

Code
dataA <- 
    read_delim(
        file = "C:/Dataset/marathonA.txt",
        col_names = TRUE,
        delim = "\t",
        col_types = c("c","c","i","d"))

dataB <- 
    read_delim(
        file = "C:/Dataset/marathonB.txt",
        col_names = TRUE,
        delim = "\t",
        col_types = c("c","c","i","d"))

dataA
sidsexagewgtst
C001M2357.1
C002F2762.3
C003M1954.5
C004M2159.4
C005F3253.4
Code
dataB
eidwgtenddura
C00353.9189
C00553  197
C00262.2201
C00156.8209

We can determine the weight change only by matching the before and after weight of each individual. This is where merging is very useful. Below, we merge the two data into one. This is done below.

Code
dataA %>% 
    full_join(dataB, by = join_by(sid == eid))
sidsexagewgtstwgtenddura
C001M2357.156.8209
C002F2762.362.2201
C003M1954.553.9189
C004M2159.4  
C005F3253.453  197

5.8 Reading in data

Code
dataF <-
    readxl::read_xlsx("C:/Dataset/SBPDATA.xlsx") %>% 
    janitor::clean_names() %>% 
    rename(
        ageyrs = a3_how_old_are_you_years,
        dxs_class = disease_class,
        gender = a1_gender
        ) %>% 
    mutate(
        dxs_class = factor(dxs_class),
        gender = factor(
            gender, 
            levels = c(0, 1), 
            labels = c("Male", "Female")))

dataF %>% select(1:5) %>% head()
siddxs_classsbp_0sbp_2sbp_4
1HPT139124130
2DM+HPT155
3HPT109123109
4HPT130
5HPT124120146
6DM+HPT140114163
Code
dat <- 
    tribble(
        ~"name", ~"day", ~"month", ~"year", ~"bp",
        "Ama", 12, 05, 2020, "120/80",
        "Kwame", 14, 02, 2019, "132/66",
        "Akosua", 21, 12, 2010, "110/76",
        "Yaw", 13, 03, 1982, "144/98",
        "Yaa", 19, 08, 2000, "117/77")

dat
namedaymonthyearbp
Ama1252.02e+03120/80
Kwame1422.02e+03132/66
Akosua21122.01e+03110/76
Yaw1331.98e+03144/98
Yaa1982e+03       117/77

5.9 arrange

Code
dat %>% arrange(name, desc(day))
namedaymonthyearbp
Akosua21122.01e+03110/76
Ama1252.02e+03120/80
Kwame1422.02e+03132/66
Yaa1982e+03       117/77
Yaw1331.98e+03144/98

5.10 unite

Code
dat %>% 
    unite(col = "dob", c(day, month, year), sep="/") 
namedobbp
Ama12/5/2020120/80
Kwame14/2/2019132/66
Akosua21/12/2010110/76
Yaw13/3/1982144/98
Yaa19/8/2000117/77

5.11 seperate

Code
dat %>% 
    separate(col = bp, into = c("sbp", "dbp"), sep = "/") 
namedaymonthyearsbpdbp
Ama1252.02e+0312080
Kwame1422.02e+0313266
Akosua21122.01e+0311076
Yaw1331.98e+0314498
Yaa1982e+03       11777
Code
dat %>% 
    separate(col = bp, into = c("sbp", "dbp"), sep = "/") %>% 
    unite(col = "dob", c(day, month, year), sep="/") %>% 
    mutate(dob_new = lubridate::dmy(dob)) 
namedobsbpdbpdob_new
Ama12/5/2020120802020-05-12
Kwame14/2/2019132662019-02-14
Akosua21/12/2010110762010-12-21
Yaw13/3/1982144981982-03-13
Yaa19/8/2000117772000-08-19

5.12 relocate

Code
dataF %>% 
    relocate(ageyrs, gender, .before = sbp_0) %>% 
    select(1:8) %>% 
    slice_head(n=10) 
siddxs_classageyrsgendersbp_0sbp_2sbp_4sbp_6
1HPT75Male139124130130
2DM+HPT60Male155
3HPT62Male109123109126
4HPT70Male130
5HPT72Male124120146144
6DM+HPT56Male140114163117
7DM+HPT51Male137135132147
8DM73Male160130
9HPT61Female153218
10HPT59Male135130118150
Code
dataF %>% 
    select(1:4) %>% 
    relocate(sid, .after = last_col()) %>% 
    slice_head(n=10) 
dxs_classsbp_0sbp_2sid
HPT1391241
DM+HPT1552
HPT1091233
HPT1304
HPT1241205
DM+HPT1401146
DM+HPT1371357
DM1601308
HPT1532189
HPT13513010
Code
dataF %>% 
    select(1:7) %>% 
    relocate(where(is.numeric)) %>% 
    slice_head(n=10) 
sidsbp_0sbp_2sbp_4sbp_6sbp_8dxs_class
1139124130130104HPT
2155DM+HPT
3109123109126108HPT
4130HPT
5124120146144157HPT
6140114163117124DM+HPT
7137135132147130DM+HPT
8160130DM
9153218HPT
10135130118150HPT
Code
dataF %>% 
    select(1:6) %>% 
    relocate(contains("sbp")) %>% 
    slice_head(n=10) 
sbp_0sbp_2sbp_4sbp_6siddxs_class
1391241301301HPT
1552DM+HPT
1091231091263HPT
1304HPT
1241201461445HPT
1401141631176DM+HPT
1371351321477DM+HPT
1601308DM
1532189HPT
13513011815010HPT

5.13 reframe & across

Code
dataF %>% 
    drop_na(dxs_class) %>% 
    reframe(
        across(
            sbp_2:sbp_8, 
            list(
                "Average" = ~mean(.x, na.rm=T),
                "Std" = ~sd(.x, na.rm=T)),
            .names = "{.fn}_{.col}"), 
        .by = dxs_class)
dxs_classAverage_sbp_2Std_sbp_2Average_sbp_4Std_sbp_4Average_sbp_6Std_sbp_6Average_sbp_8Std_sbp_8
HPT14022.913722.413721.513621.6
DM+HPT14424  14424.514323.914324.5
DM12819.512820.312819.512619.2
Code
dataF %>% 
    na.omit() %>% 
    select(dxs_class, sbp_0:sbp_6) %>% 
    group_by(dxs_class) %>% 
    reframe(across(where(is.numeric), ~quantile(.x))) 
dxs_classsbp_0sbp_2sbp_4sbp_6
DM81708494
DM114116113114
DM124125125126
DM134139138140
DM189194199187
DM+HPT98818870
DM+HPT129127125126
DM+HPT142141142139
DM+HPT158157158155
DM+HPT216231231234
HPT90717888
HPT126124120120
HPT138135132131
HPT151150147146
HPT219221220209

5.14 Distinct observations

Code
dataF %>% 
    reframe(across(where(is.numeric), n_distinct)) 
sidsbp_0sbp_2sbp_4sbp_6sbp_8sbp_10sbp_12sbp_14sbp_16sbp_18ageyrs
329613814114513813513413413313012777

5.15 list of functions

Code
dataF %>% 
    filter(!is.na(dxs_class)&!is.na(gender)) %>% 
    group_by(dxs_class, gender) %>%
    reframe(
        across(
            starts_with("sbp"), 
            list(
                AVG = mean, 
                SD = sd, 
                N_missing = ~sum(is.na(.x), na.rm=TRUE)))) 
dxs_classgendersbp_0_AVGsbp_0_SDsbp_0_N_missingsbp_2_AVGsbp_2_SDsbp_2_N_missingsbp_4_AVGsbp_4_SDsbp_4_N_missingsbp_6_AVGsbp_6_SDsbp_6_N_missingsbp_8_AVGsbp_8_SDsbp_8_N_missingsbp_10_AVGsbp_10_SDsbp_10_N_missingsbp_12_AVGsbp_12_SDsbp_12_N_missingsbp_14_AVGsbp_14_SDsbp_14_N_missingsbp_16_AVGsbp_16_SDsbp_16_N_missingsbp_18_AVGsbp_18_SDsbp_18_N_missing
DMMale  15770757479798594110
DMFemale12819.80172329302932365055
DM+HPTMale14523  084113119145153153186204247
DM+HPTFemale14722  0263639454355576884
HPTMale  3237309354401414473522639741
HPTFemale14420.9074116131147145165173216241

5.16 Summarizing by anonymous functions

Code
dataF %>% 
    filter(!is.na(dxs_class)) %>% 
    group_by(dxs_class) %>%
    reframe(
        across(
            .cols = c(sbp_0), 
            .fns = list(
                "Mean"    = ~mean(.x, na.rm=T), 
                "UpperCI" = ~mean(
                    .x, na.rm=T) + 1.96*sd(.x, na.rm=T)/sqrt(n()) ,
                "LowerCI" = ~mean(
                    .x, na.rm=T) - 1.96*sd(.x, na.rm=T)/sqrt(n())))) 
dxs_classsbp_0_Meansbp_0_UpperCIsbp_0_LowerCI
DM126128124
DM+HPT145147144
HPT142143141

5.17 expand

Code
dataF %>% 
    filter(!is.na(dxs_class) & !is.na(gender)) %>% 
    expand(dxs_class, gender) 
dxs_classgender
DMMale
DMFemale
DM+HPTMale
DM+HPTFemale
HPTMale
HPTFemale

5.18 crossing

Code
dataF %>% 
    filter(!is.na(dxs_class) & !is.na(gender)) %>% 
    select(dxs_class, gender) %>% 
    crossing() 
dxs_classgender
DMMale
DMFemale
DM+HPTMale
DM+HPTFemale
HPTMale
HPTFemale

5.19 Adding a running id

Code
dataF %>% 
    filter(!is.na(dxs_class) & !is.na(gender)) %>% 
    select(dxs_class, gender) %>%
    mutate(running_id = row_number()) %>% 
    slice_head(n=10) 
dxs_classgenderrunning_id
HPTMale1
DM+HPTMale2
HPTMale3
HPTMale4
HPTMale5
DM+HPTMale6
DM+HPTMale7
DMMale8
HPTFemale9
HPTMale10

5.20 pivot_longer & pivot_wider

Code
dataF_long <-
    dataF %>% 
    select(gender, dxs_class, sbp_0:sbp_18) %>% 
    pivot_longer(
        cols = starts_with("sbp"),
        names_to = "measure",
        values_to = "sbp",
        values_drop_na = TRUE)

dataF_long %>% 
    slice_head(n=10) 
genderdxs_classmeasuresbp
MaleHPTsbp_0139
MaleHPTsbp_2124
MaleHPTsbp_4130
MaleHPTsbp_6130
MaleHPTsbp_8104
MaleHPTsbp_10129
MaleHPTsbp_1280
MaleHPTsbp_14129
MaleHPTsbp_16126
MaleHPTsbp_18135
Code
dataF %>% 
    select(dxs_class, gender, sbp_0, sbp_4) %>%
    na.omit() %>% 
    group_by(dxs_class) %>% 
    pivot_wider(
        names_from = gender, 
        values_from = c(sbp_0, sbp_4), 
        values_fn = ~mean(.x, na.rm = TRUE)) %>% 
    ungroup()  
dxs_classsbp_0_Malesbp_0_Femalesbp_4_Malesbp_4_Female
HPT141144136141
DM+HPT145147144146
DM126130125135

5.21 tidyquant tabulation

Code
dataF %>% 
    select(dxs_class, gender) %>% 
    na.omit() %>% 
    tidyquant::pivot_table(
        .rows = gender, .columns = dxs_class, .values = ~n()
    ) 
genderDMDM+HPTHPT
Male3087771434
Female114226430
Code
dataF %>% 
    select(dxs_class, gender, sbp_0, sbp_2) %>% 
    na.omit() %>% 
    tidyquant::pivot_table(
        .rows = gender, 
        .columns = dxs_class, 
        .values = ~quantile(sbp_0)) %>% 
    unnest(cols = c("DM","HPT","DM+HPT")) 
genderDMDM+HPTHPT
Male909570
Male113128126
Male124142140
Male134158154
Male182228224
Female819898
Female117131129
Female127145141
Female138161159
Female208220210

5.22 rowwise manipulations

Code
dataF %>% 
    rowwise() %>% 
    mutate(
        sbp_mean = mean(
            c(sbp_0,sbp_2,sbp_4,sbp_6,sbp_8, sbp_10, sbp_12,
              sbp_14,sbp_16, sbp_18), na.rm=T),
        sbp_sd = sd(
            c(sbp_0,sbp_2,sbp_4,sbp_6,sbp_8, sbp_10, sbp_12,
              sbp_14,sbp_16, sbp_18), na.rm=T),
        n = n()) %>% 
    ungroup() %>% 
    select(sid, dxs_class, sbp_mean, sbp_sd, 
        sbp_0:sbp_4) %>% 
    slice_head(n=10) 
siddxs_classsbp_meansbp_sdsbp_0sbp_2sbp_4
1HPT12317.6 139124130
2DM+HPT155   155
3HPT1169.25109123109
4HPT130   130
5HPT13113.4 124120146
6DM+HPT12516.8 140114163
7DM+HPT1357.75137135132
8DM14521.2 160130
9HPT19637.5 153218
10HPT13311.5 135130118

5.23 str_glue

Code
x <- c("Ama", "is", "a", "Girl")
cat(x)
Ama is a Girl
Code
name <- "Fred"
str_glue('My name is {name}.')
My name is Fred.
Code
stringr_fcn <- "`stringr::str_glue()`"
glue_fcn    <- "`glue::glue()`"

str_glue('{stringr_fcn} is essentially an alias for {glue_fcn}.')
`stringr::str_glue()` is essentially an alias for `glue::glue()`.
Code
name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")
str_glue('My name is {name},',
  ' my age next year is {age + 1},',
  ' my anniversary is {format(anniversary, "%A, %B %d, %Y")}.')
My name is Fred, my age next year is 51, my anniversary is Saturday, October 12, 1991.
Code
str_glue('My name is {name},',
  ' my age next year is {age + 1},',
  ' my anniversary is {format(anniversary, "%A, %B %d, %Y")}.',
  name = "Joe",
  age = 40,
  anniversary = as.Date("2001-10-12"))
My name is Joe, my age next year is 41, my anniversary is Friday, October 12, 2001.
Code
mtcars %>% 
    head() 
mpgcyldisphpdratwtqsecvsamgearcarb
21  61601103.9 2.6216.50144
21  61601103.9 2.8817  0144
22.84108933.852.3218.61141
21.462581103.083.2119.41031
18.783601753.153.4417  0032
18.162251052.763.4620.21031
Code
head(mtcars) %>% 
    glue::glue_data("{rownames(.)} has {hp} hp")
Mazda RX4 has 110 hp
Mazda RX4 Wag has 110 hp
Datsun 710 has 93 hp
Hornet 4 Drive has 110 hp
Hornet Sportabout has 175 hp
Valiant has 105 hp
Code
head(iris) %>%
  mutate(
      description = str_glue(
          "This {Species} has a petal length of {Petal.Length}"
          )
      ) 
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesdescription
5.13.51.40.2setosaThis setosa has a petal length of 1.4
4.93  1.40.2setosaThis setosa has a petal length of 1.4
4.73.21.30.2setosaThis setosa has a petal length of 1.3
4.63.11.50.2setosaThis setosa has a petal length of 1.5
5  3.61.40.2setosaThis setosa has a petal length of 1.4
5.43.91.70.4setosaThis setosa has a petal length of 1.7
Code
str_glue("
    A formatted string
    Can have multiple lines
      with additional indention preserved
    ")
A formatted string
Can have multiple lines
  with additional indention preserved
Code
str_glue("

  leading or trailing newlines can be added explicitly

  ")

leading or trailing newlines can be added explicitly
Code
str_glue("
    A formatted string \\
    can also be on a \\
    single line
    ")
A formatted string can also be on a single line
Code
name <- "Fred"
str_glue("My name is {name}, not {{name}}.")
My name is Fred, not {name}.
Code
one <- "1"
str_glue(
    "The value of $e^{2\\pi i}$ is $<<one>>$.", 
    .open = "<<", 
    .close = ">>")
The value of $e^{2\pi i}$ is $1$.
Code
dataF %>% 
    filter(!is.na(sbp_0)) %>% 
    ggplot(aes(x=sbp_0)) +
    geom_histogram(col = "grey", fill = "wheat") +
    labs(title = str_glue(
        "Histogram with Mean = {mean_sbp0}mmHg and \\
         Standard Deviation = {sd_sbp0}",
        mean_sbp0 = mean(dataF$sbp_0, na.rm=T) %>% 
            round(1),
        sd_sbp0 = sd(dataF$sbp_0,   na.rm=T) %>% 
            round(1)),
         x = "Systolic Blood Pressure (mmHg)",
         y = "Frequency") +
    theme_light(base_size = 12, base_family = "serif")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.