简体   繁体   中英

Convert long to wide dataset using data.table::dcast or tidyr

Given the following data in long format. Would like to do this for an arbitrary number of timepoints .

    dat <- structure(list(srdr_id = c("172507", "172507", "172507", "172507", 
"172619", "172619", "172619", "172619"), arm = c("CBT_Educ", 
"CBT_MI", "CBT_Educ", "CBT_MI", "MI", "Educ", "MI", "Educ"), 
    timepoint = c(0, 0, 3, 3, 0, 0, 3, 3), n = c(102, 103, 100, 
    101, 58, 61, 45, 53), mean = c(37.69, 40.23, 34.53, 31.8, 
    4.6, 4.3, 4.4, 4.1), sd = c(16.06, 14.23, 19.78, 19.67, 2.2, 
    2.2, 2.3, 2.5)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-8L))

Long dataset:

  srdr_id arm      timepoint     n  mean    sd
  <chr>   <chr>        <dbl> <dbl> <dbl> <dbl>
1 172507  CBT_Educ         0   102  37.7  16.1
2 172507  CBT_MI           0   103  40.2  14.2
3 172507  CBT_Educ         3   100  34.5  19.8
4 172507  CBT_MI           3   101  31.8  19.7
5 172619  MI               0    58   4.6   2.2
6 172619  Educ             0    61   4.3   2.2
7 172619  MI               3    45   4.4   2.3
8 172619  Educ             3    53   4.1   2.5

I would like to create a wide dataset, such that within each srdr_id and arm the three variables (n, mean and sd) appear in the same row.

  srdr_id arm         n.0  mean.0 sd.0 n.3 mean.3  sd.3

1 172507  CBT_Educ     102  37.7  16.1  100  34.5  19.8
2 172507  CBT_MI       103  40.2  14.2  101  31.8  19.7
5 172619  MI            58   4.6   2.2   45   4.4   2.3
6 172619  Educ          61   4.3   2.2   53   4.1   2.5

The following failed with:

Error in is.formula(formula) : object 'srdr_id' not found

data.table::dcast(data = dat, srdr_id + arm, value.var = c(n_analyzed, mean, sd))

A common workflow for this type of situation is gathering all the metrics, renaming them, and then spreading again. See below:

tidyverse:

dat %>%
  gather("measure", "val", n, mean, sd) %>%
  mutate(measure = paste0(measure, ".", timepoint)) %>%
  select(-timepoint) %>%
  spread(measure, val)

# A tibble: 4 x 8
  srdr_id arm      mean.0 mean.3   n.0   n.3  sd.0  sd.3
  <chr>   <chr>     <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 172507  CBT_Educ   37.7   34.5   102   100  16.1  19.8
2 172507  CBT_MI     40.2   31.8   103   101  14.2  19.7
3 172619  Educ        4.3    4.1    61    53   2.2   2.5
4 172619  MI          4.6    4.4    58    45   2.2   2.3

data.table:

library(data.table)

dt <- as.data.table(dat)

melt(dt, id.vars = c("srdr_id", "arm", "timepoint"))[
  ,`:=`(variable = paste0(variable, ".", timepoint), timepoint = NULL)
  ] %>%
  dcast(srdr_id + arm ~ variable, value.var = "value")

   srdr_id      arm mean.0 mean.3 n.0 n.3  sd.0  sd.3
1:  172507 CBT_Educ  37.69  34.53 102 100 16.06 19.78
2:  172507   CBT_MI  40.23  31.80 103 101 14.23 19.67
3:  172619     Educ   4.30   4.10  61  53  2.20  2.50
4:  172619       MI   4.60   4.40  58  45  2.20  2.30

One alternative (probably not the most elegant), is to use group_by() and summarise() from the library dplyr . Here, you don't have to make some calculations (all values are already in your inital dataset), so you can use functions like first() and last() to specify with values you want.

dat %>% 
  group_by(srdr_id, arm) %>% 
  summarise(
    n0 = first(n),     mean0 = first(mean),    sd0 = first(sd), 
    n3 = last(n),      mean3 = last(mean),     sd3 = last(sd)
  )

#   srdr_id arm         n0 mean0   sd0    n3 mean3   sd3
#   <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 172507  CBT_Educ   102  37.7  16.1   100  34.5  19.8
# 2 172507  CBT_MI     103  40.2  14.2   101  31.8  19.7
# 3 172619  Educ        61   4.3   2.2    53   4.1   2.5
# 4 172619  MI          58   4.6   2.2    45   4.4   2.3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM