[英]Reshape long to wide with dcast
I need to reshape a dataframe csf_antibio (sample below) from long to wide format, separating all values per patient by row.我需要将 dataframe csf_antibio (下面的示例)从长格式重塑为宽格式,按行分隔每个患者的所有值。
study_id,proanbx_dt,proanbx_tm,name,othername,route,dosage,units,doses,freq
CHLA_0001,2021-07-22,20:01:00,ceftriaxone,,IV,1250,mg,4,13
CHLA_0001,2021-07-22,20:19:00,metronidazole,,IV,250,mg,5,9
CHLA_0001,2021-07-22,23:17:00,vancomycin,,IV,350,mg,3,6
CHLA_0001,2021-08-09,19:34:00,cefazolin,,IV,738,mg,1,8
CHLA_0002,2020-12-18,0:30:00,cefepime,,IV,75,mg,5,8
CHLA_0002,2020-12-18,1:03:00,vancomycin,,IV,23,mg,4,13
CHLA_0002,2020-12-19,18:15:00,cefepime,,IV,60,mg,6,8
CHLA_0002,2020-12-20,4:18:00,vancomycin,,IV,24,mg,4,12
CHLA_0003,2021-04-20,15:17:00,meropenem,,IV,200,mg,2,1
CHLA_0003,2021-04-21,2:20:00,meropenem,,IV,400,mg,17,8
CHLA_0003,2021-04-22,14:16:00,Other,sulfamethoxazole-trimethoprim,IV,50,mg,9,12
I tried the following without success:我尝试了以下但没有成功:
csfmelt <- melt(csf_antibio, id.vars=1:1)
csf <- dcast(csfmelt, study_id ~ variable, value.var = "value", fun.aggregate = sum)
I want the final dataframe to have each study id per row with variables我希望最终的 dataframe 每行都有每个研究 ID 和变量
study_id,proanbx_dt1,proanbx_tm1,name1,othername1,route1,dosage1,units1,doses1,freq1,proanbx_dt2,proanbx_tm2,name2,othername2,route2,dosage2,units2,doses2,freq2,proanbx_dt3,proanbx_tm3,name3,othername3,route3,dosage3,units3,doses3,freq3,proanbx_dt4,proanbx_tm4,name4,othername4,route4,dosage4,units4,doses4,freq4
CHLA_0001,2021-07-22,20:01:00,ceftriaxone,,IV,1250,mg,4,13, 2021-07-22,20:19:00,metronidazole,,IV,250,mg,5,9, 2021-07-22,23:17:00,vancomycin,,IV,350,mg,3,6,2021-08-09,19:34:00,cefazolin,,IV,738,mg,1,8
CHLA_0002,2020-12-18,0:30:00,cefepime,,IV,75,mg,5,8,2020-12-18,1:03:00,vancomycin,,IV,23,mg,4,13,2020-12-19,18:15:00,cefepime,,IV,60,mg,6,8,2020-12-20,4:18:00,vancomycin,,IV,24,mg,4,12,2021-04-20,15:17:00,meropenem,,IV,200,mg,2,1,2021-04-21,2:20:00,meropenem,,IV,400,mg,17,8,2021-04-22,14:16:00,Other,sulfamethoxazole-trimethoprim,IV,50,mg,9,12
Thanks in advance!提前致谢!
Your desired output has a "number" component that is not naturally inferred by dcast
.您想要的 output 具有dcast
无法自然推断的“数字”组件。 We can add it relatively easily with ave
(base R, certainly this can be done just as easily in data.table
or dplyr
groupings).我们可以使用ave
相对轻松地添加它(基数 R,当然这可以在data.table
或dplyr
分组中轻松完成)。
csfmelt$num <- ave(seq(nrow(csfmelt)), csfmelt[c("study_id","variable")], FUN = seq_along)
head(csfmelt)
# study_id variable value num
# 1 CHLA_0001 proanbx_dt 2021-07-22 1
# 2 CHLA_0001 proanbx_dt 2021-07-22 2
# 3 CHLA_0001 proanbx_dt 2021-07-22 3
# 4 CHLA_0001 proanbx_dt 2021-08-09 4
# 5 CHLA_0002 proanbx_dt 2020-12-18 1
# 6 CHLA_0002 proanbx_dt 2020-12-18 2
csfwide <- reshape2::dcast(csfmelt, study_id ~ variable + num, value.var = "value")
csfwide
# study_id proanbx_dt_1 proanbx_dt_2 proanbx_dt_3 proanbx_dt_4 proanbx_tm_1 proanbx_tm_2 proanbx_tm_3 proanbx_tm_4 name_1 name_2 name_3 name_4 othername_1 othername_2 othername_3 othername_4 route_1 route_2 route_3 route_4 dosage_1 dosage_2 dosage_3 dosage_4 units_1 units_2 units_3 units_4 doses_1 doses_2 doses_3 doses_4 freq_1 freq_2 freq_3 freq_4
# 1 CHLA_0001 2021-07-22 2021-07-22 2021-07-22 2021-08-09 20:01:00 20:19:00 23:17:00 19:34:00 ceftriaxone metronidazole vancomycin cefazolin IV IV IV IV 1250 250 350 738 mg mg mg mg 4 5 3 1 13 9 6 8
# 2 CHLA_0002 2020-12-18 2020-12-18 2020-12-19 2020-12-20 0:30:00 1:03:00 18:15:00 4:18:00 cefepime vancomycin cefepime vancomycin IV IV IV IV 75 23 60 24 mg mg mg mg 5 4 6 4 8 13 8 12
# 3 CHLA_0003 2021-04-20 2021-04-21 2021-04-22 <NA> 15:17:00 2:20:00 14:16:00 <NA> meropenem meropenem Other <NA> sulfamethoxazole-trimethoprim <NA> IV IV IV <NA> 200 400 50 <NA> mg mg mg <NA> 2 17 9 <NA> 1 8 12 <NA>
The column order is not what you requested, but it can be conformed a bit with this:列顺序不是您所要求的,但它可以与此保持一致:
variables <- as.character(unique(csfmelt$variable))
sub(".*_", "", names(csfwide)[-(1:2)])
# [1] "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4" "1" "2" "3" "4"
sub("_[^_]*$", "", names(csfwide)[-(1:2)])
# [1] "proanbx_dt" "proanbx_dt" "proanbx_dt" "proanbx_tm" "proanbx_tm" "proanbx_tm" "proanbx_tm" "name" "name" "name" "name" "othername" "othername" "othername" "othername" "route" "route" "route" "route" "dosage"
# [21] "dosage" "dosage" "dosage" "units" "units" "units" "units" "doses" "doses" "doses" "doses" "freq" "freq" "freq" "freq"
nms <- names(csfwide)[-(1:2)]
newnms <- nms[order(sub(".*_", "", nms), match(nms, variables))]
csfwide2 <- subset(csfwide, select = c(names(csfwide)[1:2], newnms))
csfwide2
# study_id proanbx_dt_1 proanbx_tm_1 name_1 othername_1 route_1 dosage_1 units_1 doses_1 freq_1 proanbx_dt_2 proanbx_tm_2 name_2 othername_2 route_2 dosage_2 units_2 doses_2 freq_2 proanbx_dt_3 proanbx_tm_3 name_3 othername_3 route_3 dosage_3 units_3 doses_3 freq_3 proanbx_dt_4 proanbx_tm_4 name_4 othername_4 route_4 dosage_4 units_4 doses_4 freq_4
# 1 CHLA_0001 2021-07-22 20:01:00 ceftriaxone IV 1250 mg 4 13 2021-07-22 20:19:00 metronidazole IV 250 mg 5 9 2021-07-22 23:17:00 vancomycin IV 350 mg 3 6 2021-08-09 19:34:00 cefazolin IV 738 mg 1 8
# 2 CHLA_0002 2020-12-18 0:30:00 cefepime IV 75 mg 5 8 2020-12-18 1:03:00 vancomycin IV 23 mg 4 13 2020-12-19 18:15:00 cefepime IV 60 mg 6 8 2020-12-20 4:18:00 vancomycin IV 24 mg 4 12
# 3 CHLA_0003 2021-04-20 15:17:00 meropenem IV 200 mg 2 1 2021-04-21 2:20:00 meropenem IV 400 mg 17 8 2021-04-22 14:16:00 Other sulfamethoxazole-trimethoprim IV 50 mg 9 12 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
@r2evans gave you a great answer, but I was thinking about your comments regarding dates and time. @r2evans 给了你一个很好的答案,但我在考虑你对日期和时间的评论。 You didn't provide how you collected this data, so I can't tell you how to import it this way.你没有提供你是如何收集这些数据的,所以我不能告诉你如何以这种方式导入它。 However, I did convert these variables in the following code.但是,我确实在以下代码中转换了这些变量。 That being said, adding dates isn't meaningful.话虽这么说,添加日期是没有意义的。 I was thinking that the number of days and the amount of time passed might be more along the lines of what you were looking for for those particular variables.我当时在想,经过的天数和时间量可能更符合您为那些特定变量寻找的内容。 Unfortunately, I wasn't able to figure out how to make it work with reshape2
.不幸的是,我无法弄清楚如何让它与reshape2
一起使用。 This uses dplyr
, tidyselect
and hms
.这使用dplyr
、 tidyselect
和hms
。 Although, you would only have to call dplyr
, because I've appended the packages for the applicable functions.虽然,您只需调用dplyr
,因为我已经附加了适用功能的包。 (You need the packages installed, though.) (不过,您需要安装软件包。)
I didn't keep the name
and othername
because it's not multiple entries.我没有保留name
和othername
名称,因为它不是多个条目。
library(dplyr)
csf_antibio = read.table(header = T, sep = ",", text = "study_id,proanbx_dt,proanbx_tm,name,othername,route,dosage,units,doses,freq
CHLA_0001,2021-07-22,20:01:00,ceftriaxone,,IV,1250,mg,4,13
CHLA_0001,2021-07-22,20:19:00,metronidazole,,IV,250,mg,5,9
CHLA_0001,2021-07-22,23:17:00,vancomycin,,IV,350,mg,3,6
CHLA_0001,2021-08-09,19:34:00,cefazolin,,IV,738,mg,1,8
CHLA_0002,2020-12-18,0:30:00,cefepime,,IV,75,mg,5,8
CHLA_0002,2020-12-18,1:03:00,vancomycin,,IV,23,mg,4,13
CHLA_0002,2020-12-19,18:15:00,cefepime,,IV,60,mg,6,8
CHLA_0002,2020-12-20,4:18:00,vancomycin,,IV,24,mg,4,12
CHLA_0003,2021-04-20,15:17:00,meropenem,,IV,200,mg,2,1
CHLA_0003,2021-04-21,2:20:00,meropenem,,IV,400,mg,17,8
CHLA_0003,2021-04-22,14:16:00,Other,sulfamethoxazole-trimethoprim,IV,50,mg,9,12")
Because the time is truly linked to the date, I wrote a function to process the time difference.因为时间是真正和日期挂钩的,所以我写了一个function来处理时差。
timer <- function(df1){
maxtm = max(df1$proanbx_tm[df1$proanbx_dt == max(df1$proanbx_dt)]) %>% hms::as_hms()
mintm = min(df1$proanbx_tm[df1$proanbx_dt == min(df1$proanbx_dt)]) %>% hms::as_hms()
if(maxtm > mintm){
tmr = (maxtm - mintm) %>% hms::as_hms() # captures mult entries in the same day
} else if(mintm > maxtm) {
tmr = (maxtm - mintm) + hms::as_hms('24:00:00') # add a full day
} else { # only one entry or the time is identical in max/min
tmr = hms::as_hms('0')
}
return(tmr)
}
I collected the column names to return the columns to the original order.我收集了列名以将列恢复为原始顺序。
ordNames = names(csf_antibio) # collect names to return order to columns
# [1] "study_id" "proanbx_dt" "proanbx_tm" "name" "othername" "route"
# [7] "dosage" "units" "doses" "freq"
# names kept = ordNames[,c(1:3,6:10)]
Find the sums and differences in time找出时间的和与差
csf2 <- csf_antibio %>%
mutate(proanbx_dt = as.Date(proanbx_dt), # convert to date
proanbx_tm = hms::as_hms(proanbx_tm)) %>% # convert to time
group_by(study_id) %>% # group by study
summarise(proanbx_tm = timer(.data), # difference in time
proanbx_dt = max(proanbx_dt) - min(proanbx_dt), # difference in days
across(tidyselect:::where(is.integer), sum),
units = "mg",
route = "IV") %>%
select(ordNames[c(1:3,6:10)])
head(csf2)
# # A tibble: 3 × 8
# study_id proanbx_dt proanbx_tm route dosage units doses freq
# <chr> <drtn> <time> <chr> <int> <chr> <int> <int>
# 1 CHLA_0001 18 days 23:33 IV 2588 mg 13 36
# 2 CHLA_0002 2 days 03:48 IV 182 mg 19 41
# 3 CHLA_0003 2 days 22:59 IV 650 mg 28 21
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.