[英]Convert long to wide dataset using data.table::dcast or tidyr
Given the following data in long format. 以长格式给出以下数据。 Would like to do this for an arbitrary number of timepoints .
想在任意数量的时间点执行此操作。
dat <- structure(list(srdr_id = c("172507", "172507", "172507", "172507",
"172619", "172619", "172619", "172619"), arm = c("CBT_Educ",
"CBT_MI", "CBT_Educ", "CBT_MI", "MI", "Educ", "MI", "Educ"),
timepoint = c(0, 0, 3, 3, 0, 0, 3, 3), n = c(102, 103, 100,
101, 58, 61, 45, 53), mean = c(37.69, 40.23, 34.53, 31.8,
4.6, 4.3, 4.4, 4.1), sd = c(16.06, 14.23, 19.78, 19.67, 2.2,
2.2, 2.3, 2.5)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-8L))
Long dataset: 长数据集:
srdr_id arm timepoint n mean sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 172507 CBT_Educ 0 102 37.7 16.1
2 172507 CBT_MI 0 103 40.2 14.2
3 172507 CBT_Educ 3 100 34.5 19.8
4 172507 CBT_MI 3 101 31.8 19.7
5 172619 MI 0 58 4.6 2.2
6 172619 Educ 0 61 4.3 2.2
7 172619 MI 3 45 4.4 2.3
8 172619 Educ 3 53 4.1 2.5
I would like to create a wide dataset, such that within each srdr_id and arm the three variables (n, mean and sd) appear in the same row. 我想创建一个宽数据集,以便在每个srdr_id中并设置三个变量(n,mean和sd)出现在同一行中。
srdr_id arm n.0 mean.0 sd.0 n.3 mean.3 sd.3
1 172507 CBT_Educ 102 37.7 16.1 100 34.5 19.8
2 172507 CBT_MI 103 40.2 14.2 101 31.8 19.7
5 172619 MI 58 4.6 2.2 45 4.4 2.3
6 172619 Educ 61 4.3 2.2 53 4.1 2.5
The following failed with: 以下失败,原因:
Error in is.formula(formula) : object 'srdr_id' not found
is.formula(formula)中的错误:找不到对象“ srdr_id”
data.table::dcast(data = dat, srdr_id + arm, value.var = c(n_analyzed, mean, sd))
A common workflow for this type of situation is gathering all the metrics, renaming them, and then spreading again. 此类情况的常见工作流程是收集所有指标,将其重命名,然后再次传播。 See below:
见下文:
dat %>%
gather("measure", "val", n, mean, sd) %>%
mutate(measure = paste0(measure, ".", timepoint)) %>%
select(-timepoint) %>%
spread(measure, val)
# A tibble: 4 x 8
srdr_id arm mean.0 mean.3 n.0 n.3 sd.0 sd.3
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 172507 CBT_Educ 37.7 34.5 102 100 16.1 19.8
2 172507 CBT_MI 40.2 31.8 103 101 14.2 19.7
3 172619 Educ 4.3 4.1 61 53 2.2 2.5
4 172619 MI 4.6 4.4 58 45 2.2 2.3
library(data.table)
dt <- as.data.table(dat)
melt(dt, id.vars = c("srdr_id", "arm", "timepoint"))[
,`:=`(variable = paste0(variable, ".", timepoint), timepoint = NULL)
] %>%
dcast(srdr_id + arm ~ variable, value.var = "value")
srdr_id arm mean.0 mean.3 n.0 n.3 sd.0 sd.3
1: 172507 CBT_Educ 37.69 34.53 102 100 16.06 19.78
2: 172507 CBT_MI 40.23 31.80 103 101 14.23 19.67
3: 172619 Educ 4.30 4.10 61 53 2.20 2.50
4: 172619 MI 4.60 4.40 58 45 2.20 2.30
One alternative (probably not the most elegant), is to use group_by()
and summarise()
from the library dplyr . 一种替代方法(可能不是最优雅的方法)是使用库dplyr中的
group_by()
和summarise()
。 Here, you don't have to make some calculations (all values are already in your inital dataset), so you can use functions like first()
and last()
to specify with values you want. 在这里,您不必进行任何计算(所有值都已经在您的初始数据集中),因此您可以使用
first()
和last()
类的函数来指定所需的值。
dat %>%
group_by(srdr_id, arm) %>%
summarise(
n0 = first(n), mean0 = first(mean), sd0 = first(sd),
n3 = last(n), mean3 = last(mean), sd3 = last(sd)
)
# srdr_id arm n0 mean0 sd0 n3 mean3 sd3
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 172507 CBT_Educ 102 37.7 16.1 100 34.5 19.8
# 2 172507 CBT_MI 103 40.2 14.2 101 31.8 19.7
# 3 172619 Educ 61 4.3 2.2 53 4.1 2.5
# 4 172619 MI 58 4.6 2.2 45 4.4 2.3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.