Wrangling simulated outbreak data

The {simulist} R package can generate line list data (sim_linelist()), contact tracing data (sim_contacts()), or both (sim_outbreak()). By default the line list produced by sim_linelist() and sim_outbreak() contains 12 columns. Some amount of post-simulation data wrangling may be needed to use the simulated epidemiological case data to certain applications. This vignette demonstrates some common data wrangling tasks that may be performed on simulated line list or contact tracing data.

This vignette provides data wrangling examples using both functions available in the R language (commonly called “base R”) as well as using tidyverse R packages, which are commonly applied to data science tasks in R. The tidyverse examples are shown by default, but select the “Base R” tab to see the equivalent functionality using base R. There are many other tools for wrangling data in R which are not covered by this vignette (e.g. {data.table}).

Simulate an outbreak

To simulate an outbreak we will use the sim_outbreak() function from the {simulist} R package.

If you are unfamiliar with the {simulist} package or the sim_outbreak() function Get Started vignette is a great place to start.

First we load in some data that is required for the outbreak simulation. Data on epidemiological parameters and distributions are read from the {epiparameter} R package.

# create contact distribution (not available from {epiparameter} database)
contact_distribution <- epiparameter(
  disease = "COVID-19",
  epi_name = "contact distribution",
  prob_distribution = create_prob_distribution(
    prob_distribution = "pois",
    prob_distribution_params = c(mean = 2)
  )
)
#> Citation cannot be created as author, year, journal or title is missing

# create infectious period (not available from {epiparameter} database)
infectious_period <- epiparameter(
  disease = "COVID-19",
  epi_name = "infectious period",
  prob_distribution = create_prob_distribution(
    prob_distribution = "gamma",
    prob_distribution_params = c(shape = 1, scale = 1)
  )
)
#> Citation cannot be created as author, year, journal or title is missing

# get onset to hospital admission from {epiparameter} database
onset_to_hosp <- epiparameter_db(
  disease = "COVID-19",
  epi_name = "onset to hospitalisation",
  single_epiparameter = TRUE
)
#> Using Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>.. 
#> To retrieve the citation use the 'get_citation' function

# get onset to death from {epiparameter} database
onset_to_death <- epiparameter_db(
  disease = "COVID-19",
  epi_name = "onset to death",
  single_epiparameter = TRUE
)
#> Using Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>.. 
#> To retrieve the citation use the 'get_citation' function

The seed is set to ensure the output of the vignette is consistent. When using {simulist}, setting the seed is not required unless you need to simulate the same line list multiple times.

set.seed(123)

outbreak <- sim_outbreak(
  contact_distribution = contact_distribution,
  infectious_period = infectious_period,
  prob_infection = 0.5,
  onset_to_hosp = onset_to_hosp,
  onset_to_death = onset_to_death
)
linelist <- outbreak$linelist
contacts <- outbreak$contacts

Censoring dates

The date event columns in simulated line lists are stored to double point precision, meaning they are the exact event times. It is unusual to not store <Date> objects as integers, as explained in ?Dates, and the print() function for <Date>s does not show that they may be part way through a day.

Here we show this by printing the date of symptom onset for the simulated data, and then unclass it to show how it is stored internally.

linelist$date_onset
#>  [1] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02"
#>  [6] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01"
#> [11] "2023-01-02" "2023-01-02" "2023-01-02"
unclass(linelist$date_onset)
#>  [1] 19358.00 19358.24 19358.04 19358.38 19359.16 19358.85 19358.14 19358.70
#>  [9] 19358.41 19358.97 19359.40 19359.47 19359.51

The censor_linelist() function can be used after sim_linelist() to censor the event dates to a given precision. Here we show censoring the dates to daily and weekly intervals. The daily censoring dates will look that same as before, but the dates will have any value after the decimal point set to zero. The weekly censored dates will be printed differently.

daily_cens_linelist <- censor_linelist(linelist, interval = "daily")
head(daily_cens_linelist)
#>   id             case_name case_type sex age date_onset date_reporting
#> 1  1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2  3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3  4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4  5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5  6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6  7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1           <NA> recovered         <NA>               <NA>              <NA>
#> 2     2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3           <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4           <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5           <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6           <NA> recovered         <NA>         2022-12-28        2023-01-03
#>   ct_value
#> 1     21.9
#> 2     22.7
#> 3       NA
#> 4     27.4
#> 5       NA
#> 6       NA

weekly_cens_linelist <- censor_linelist(linelist, interval = "weekly")
head(weekly_cens_linelist)
#>   id             case_name case_type sex age date_onset date_reporting
#> 1  1         Fabian Mrazik confirmed   m  90   2022-W52       2022-W52
#> 2  3       Ashley Martinez confirmed   f  71   2022-W52       2022-W52
#> 3  4                Tia Vu  probable   f  48   2022-W52       2022-W52
#> 4  5 Abdul Majeed el-Saleh confirmed   m  77   2022-W52       2022-W52
#> 5  6        Courtney Flood suspected   f  83   2023-W01       2023-W01
#> 6  7          Joseph Jiron suspected   m  56   2022-W52       2022-W52
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1           <NA> recovered         <NA>               <NA>              <NA>
#> 2       2023-W01      died     2023-W02           2022-W52          2023-W01
#> 3           <NA> recovered         <NA>           2022-W52          2023-W01
#> 4           <NA> recovered         <NA>           2022-W52          2023-W01
#> 5           <NA> recovered         <NA>           2022-W52          2023-W01
#> 6           <NA> recovered         <NA>           2022-W52          2023-W01
#>   ct_value
#> 1     21.9
#> 2     22.7
#> 3       NA
#> 4     27.4
#> 5       NA
#> 6       NA

See ?censor_linelist() for more information on how to use this function.

By using censor_linelist() it avoids common mistakes when working with <Date> objects. For example, rounding a date that is over half way through a day will mistakenly result in the next day. Using censor_linelist() avoids this and other common mistakes.

linelist$date_onset
#>  [1] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02"
#>  [6] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01"
#> [11] "2023-01-02" "2023-01-02" "2023-01-02"
round(linelist$date_onset)
#>  [1] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02"
#>  [6] "2023-01-02" "2023-01-01" "2023-01-02" "2023-01-01" "2023-01-02"
#> [11] "2023-01-02" "2023-01-02" "2023-01-03"
daily_cens_linelist$date_onset
#>  [1] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02"
#>  [6] "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01"
#> [11] "2023-01-02" "2023-01-02" "2023-01-02"

The censored line list dates can be used with methods that account for censoring when fitting delay distributions such as {primarycensored}.

Under-reporting of cases and contacts

In this section we’ll show how case line lists and contact tracing data sets can be subset to represent under-reporting, a common feature of real-world outbreak data, especially in resource-limited settings.

In the line list each case in unlinked (i.e. information on each row is independent of information on every other row). This means we can remove rows in the line list without having to augment any remaining rows. We assume for this example that the probability of being reported, and thus included in the line list, is independent on case type, sex, age and the phase of the outbreak.

For this example we’ll assume the case reporting probability in the line list is 50%.

Tidyverse

linelist %>%
  filter(as.logical(rbinom(n(), size = 1, prob = 0.5)))
#>   id       case_name case_type sex age date_onset date_reporting date_admission
#> 1  1   Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01           <NA>
#> 2  3 Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01     2023-01-08
#> 3  4          Tia Vu  probable   f  48 2023-01-01     2023-01-01           <NA>
#> 4  6  Courtney Flood suspected   f  83 2023-01-02     2023-01-02           <NA>
#> 5  7    Joseph Jiron suspected   m  56 2023-01-01     2023-01-01           <NA>
#> 6  8    Kevin Liddle suspected   m  39 2023-01-01     2023-01-01           <NA>
#> 7 21   Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02           <NA>
#>     outcome date_outcome date_first_contact date_last_contact ct_value
#> 1 recovered         <NA>               <NA>              <NA>     21.9
#> 2      died   2023-01-10         2022-12-26        2023-01-06     22.7
#> 3 recovered         <NA>         2022-12-30        2023-01-05       NA
#> 4 recovered         <NA>         2022-12-26        2023-01-04       NA
#> 5 recovered         <NA>         2022-12-28        2023-01-03       NA
#> 6 recovered         <NA>         2022-12-31        2023-01-03       NA
#> 7 recovered         <NA>         2023-01-01        2023-01-03       NA

Base R

idx <- as.logical(rbinom(n = nrow(linelist), size = 1, prob = 0.5))
linelist[idx, ]
#>    id       case_name case_type sex age date_onset date_reporting
#> 2   3 Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4          Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 5   6  Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 8   9 Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10 Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14     Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 12 21   Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24 Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04
#>    ct_value
#> 2      22.7
#> 3        NA
#> 5        NA
#> 8      24.2
#> 9        NA
#> 10     21.3
#> 12       NA
#> 13       NA

The above example randomly sample rows in the line list using the reporting probability resulting in different number of cases being kept each time the code is run. To subset the line list data and get the same number rows (i.e. cases) returned slice_sample() can be used instead.

linelist %>%
  dplyr::slice_sample(prop = 0.5) %>%
  dplyr::arrange(id)
#>   id             case_name case_type sex age date_onset date_reporting
#> 1  1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2  4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 3  5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 4  6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 5 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 6 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>   date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1           <NA> recovered         <NA>               <NA>              <NA>
#> 2           <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 3           <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 4           <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 5           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 6           <NA> recovered         <NA>         2022-12-28        2023-01-04
#>   ct_value
#> 1     21.9
#> 2       NA
#> 3     27.4
#> 4       NA
#> 5       NA
#> 6       NA

slice_sample() can reorder rows so we order by ID to keep the cases in order of symptom onset date.

On to under-reporting in contact tracing data. Unlike line list data, contact tracing data is linked. The direction of contact and possibly transmission is recorded in the $from and $to columns.

For this example we will assume that the contact tracing under-reporting is applicable to infections and contacts that were not infected. However, the same method could be applied for under-reporting of the transmission chain by first subsetting to infections only (see vis-linelist.Rmd vignette for example).

We plot the full contact network so it can be compared to the contact networks with under-reporting plotted below.

epicontacts <- make_epicontacts(
  linelist = linelist,
  contacts = contacts,
  id = "case_name",
  from = "from",
  to = "to",
  directed = TRUE
)
plot(epicontacts)

First we randomly sample who is not reported in the outbreak data. For this example we assume the pool of people that can be unreported is everyone in the contact network (infections and contacts), and assume a 50% reporting probability.

all_contacts <- unique(c(contacts$from, contacts$to))
not_reported <- sample(x = all_contacts, size = 0.5 * length(all_contacts))
not_reported
#>  [1] "Katlyn Nelson"      "Jin Fu"             "Breanna Hofbauer"  
#>  [4] "Ashley Martinez"    "Emily Abo"          "Forrest Anderson"  
#>  [7] "Miguel Oyebi"       "Shabeeba el-Younes" "Sarah Bridwell"    
#> [10] "Rutaiba el-Raad"    "Yvonne Howard"      "Kevin Liddle"      
#> [13] "Nicholas Rentie"

Next we subset the contact tracing data by removing infectees if that are not reported. Because the contact tracing data is linked across rows, we also need to set any unreported infectees to NA for any secondary infections they cause.

# make copy of contact tracing data for under-reporting
contacts_ur <- contacts
for (person in not_reported) {
  contacts_ur <- contacts_ur[contacts_ur$to != person, ]
  contacts_ur[contacts_ur$from %in% person, "from"] <- NA
}
head(contacts_ur)
#>                     from                    to age sex date_first_contact
#> 3          Fabian Mrazik                Tia Vu  48   f         2022-12-30
#> 4          Fabian Mrazik Abdul Majeed el-Saleh  77   m         2022-12-31
#> 5                   <NA>        Courtney Flood  83   f         2022-12-26
#> 6                   <NA>          Joseph Jiron  56   m         2022-12-28
#> 9  Abdul Majeed el-Saleh       Jaime Middleton   1   m         2022-12-26
#> 13                  <NA>           Emily Fyffe  16   f         2022-12-30
#>    date_last_contact was_case status
#> 3         2023-01-05     TRUE   case
#> 4         2023-01-08     TRUE   case
#> 5         2023-01-04     TRUE   case
#> 6         2023-01-03     TRUE   case
#> 9         2023-01-02     TRUE   case
#> 13        2023-01-02     TRUE   case

We can plot this new contact network with {epicontacts}. We’ll need to subset the line list to have the same unreported cases.

linelist_ur <- linelist[!linelist$case_name %in% not_reported, ]
epicontacts <- make_epicontacts(
  linelist = linelist_ur,
  contacts = contacts_ur,
  id = "case_name",
  from = "from",
  to = "to",
  directed = TRUE
)
plot(epicontacts)

The above example can be thought of as resulting from incomplete recording or recall of contacts. A second method for under-reporting of contact tracing data is to assume that if a case is unreported then all of the cases and contacts stemming from the unreported case are lost.

For this example we’ll sample a single individual not to report and then prune all cases and contacts from that individual in the network.

all_contacts <- unique(c(contacts$from, contacts$to))
not_reported <- sample(x = all_contacts, size = 1)
not_reported
#> [1] "Abdul Majeed el-Saleh"

Then we can recursively pruned all cases and contacts that are the result from this individual (this can be zero if the person had no secondary cases or contacts).

# make copy of contact tracing data for under-reporting
contacts_ur <- contacts
while (length(not_reported) > 0) {
  contacts_ur <- contacts_ur[!contacts_ur$to %in% not_reported, ]
  not_reported_ <- contacts_ur$to[contacts_ur$from %in% not_reported]
  contacts_ur <- contacts_ur[!contacts_ur$from %in% not_reported, ]
  not_reported <- not_reported_
}
head(contacts_ur)
#>              from              to age sex date_first_contact date_last_contact
#> 1   Fabian Mrazik   Yvonne Howard   9   f         2022-12-31        2023-01-05
#> 2   Fabian Mrazik Ashley Martinez  71   f         2022-12-26        2023-01-06
#> 3   Fabian Mrazik          Tia Vu  48   f         2022-12-30        2023-01-05
#> 5 Ashley Martinez  Courtney Flood  83   f         2022-12-26        2023-01-04
#> 6 Ashley Martinez    Joseph Jiron  56   m         2022-12-28        2023-01-03
#> 7          Tia Vu    Kevin Liddle  39   m         2022-12-31        2023-01-03
#>   was_case         status
#> 1    FALSE under_followup
#> 2     TRUE           case
#> 3     TRUE           case
#> 5     TRUE           case
#> 6     TRUE           case
#> 7     TRUE           case

Just as above we can plot the new contact network using {epicontacts}.

# subset line list to match under-reporting in contact tracing data
linelist_ur <- linelist[linelist$case_name %in% unique(contacts$from), ]

epicontacts <- make_epicontacts(
  linelist = linelist_ur,
  contacts = contacts_ur,
  id = "case_name",
  from = "from",
  to = "to",
  directed = TRUE
)
plot(epicontacts)

There are more complex under-reporting depending on covariates in the line list and contact tracing data such as $case_type in the line list, with suspected cases most likely to go unreported, or $status in the contact tracing data, with unknown or lost_to_followup more likely to be under-reported.

Removing a line list column

Not every column in the simulated line list may be required for the use case at hand. In this example we will remove the $ct_value column. For instance, if we wanted to simulate an outbreak for which no laboratory testing (e.g Polymerase chain reaction, PCR, testing) was available and thus a Cycle threshold (Ct) value would not be known for confirmed cases.

Tidyverse

# remove column by name
linelist %>% # nolint one_call_pipe_linter
  select(!ct_value)
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1            <NA> recovered         <NA>               <NA>              <NA>
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4            <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6            <NA> recovered         <NA>         2022-12-28        2023-01-03
#> 7            <NA> recovered         <NA>         2022-12-31        2023-01-03
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 11     2023-01-05 recovered         <NA>         2022-12-30        2023-01-05
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04

Base R

# remove column by numeric column indexing
# ct_value is column 12 (the last column)
linelist[, -12]
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact ct_value
#> 1            <NA> recovered         <NA>               <NA>     21.9
#> 2      2023-01-08      died   2023-01-10         2022-12-26     22.7
#> 3            <NA> recovered         <NA>         2022-12-30       NA
#> 4            <NA> recovered         <NA>         2022-12-31     27.4
#> 5            <NA> recovered         <NA>         2022-12-26       NA
#> 6            <NA> recovered         <NA>         2022-12-28       NA
#> 7            <NA> recovered         <NA>         2022-12-31       NA
#> 8            <NA> recovered         <NA>         2022-12-29     24.2
#> 9            <NA> recovered         <NA>         2022-12-26       NA
#> 10     2023-01-02 recovered         <NA>         2022-12-30     21.3
#> 11     2023-01-05 recovered         <NA>         2022-12-30     26.0
#> 12           <NA> recovered         <NA>         2023-01-01       NA
#> 13           <NA> recovered         <NA>         2022-12-28       NA

# remove column by column name
linelist[, colnames(linelist) != "ct_value"]
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1            <NA> recovered         <NA>               <NA>              <NA>
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4            <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6            <NA> recovered         <NA>         2022-12-28        2023-01-03
#> 7            <NA> recovered         <NA>         2022-12-31        2023-01-03
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 11     2023-01-05 recovered         <NA>         2022-12-30        2023-01-05
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04

# remove column by assigning it to NULL
linelist$ct_value <- NULL
linelist
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1            <NA> recovered         <NA>               <NA>              <NA>
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4            <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6            <NA> recovered         <NA>         2022-12-28        2023-01-03
#> 7            <NA> recovered         <NA>         2022-12-31        2023-01-03
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 11     2023-01-05 recovered         <NA>         2022-12-30        2023-01-05
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04