[英]Collapse only some variables long to wide format in R
I am relatively new to R, and every time I need to "reshape" data, I am absolutely baffled. 我对R比较陌生,每当我需要“重塑”数据时,我都会感到困惑。 I have data that looks like this:
我有看起来像这样的数据:
HAVE: 有:
ID ever_smoked alcoholic medication dosage
1 1 no no humira/adalimumab 40mg
2 1 no no prednisone 15mg
3 1 no no azathioprine 30mg
4 1 no no rowasa 9mg
5 2 yes no lialda 20mg
6 2 yes no mercaptopurine 1g
7 2 yes no asacol 1600mg
WANT: 想:
ID ever_smoked alcoholic medication
1 1 no no humira/adalimumab, prednisone, azathioprine, rowasa
2 2 yes no lialda, mercaptopurine, asacol
dosage most_recent_med most_recent_dose
1 40mg, 15mg, 30mg, 9mg rowasa 9mg
2 20mg, 1g, 1600mg asacol 1600mg
(Please note that it should be 2 observations and 7 variables). (请注意,它应该是2个观察值和7个变量)。
In essence, I want to (1) only collapse a few of the variables, and (2) retain the first row of the other variables, and also (3) create 2 new variables based on the last entries of some of the variables. 本质上,我想(1)仅折叠一些变量,并且(2)保留其他变量的第一行,并且(3)基于某些变量的最后一个条目创建2个新变量。
code to reproduce: 复制代码:
have <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2),
ever_smoked = c("no", "no", "no", "no", "yes", "yes", "yes"),
alcoholic = c("no", "no", "no", "no", "no", "no", "no"),
medication = c("humira/adalimumab", "prednisone", "azathioprine", "rowasa", "lialda", "mercaptopurine", "asacol"),
dosage = c("40mg", "15mg", "30mg", "9mg", "20mg", "1g", "1600mg"), stringsAsFactors = FALSE)
want <- data.frame(ID = c(1, 2),
ever_smoked = c("no", "yes"),
alcoholic = c("no", "no"),
medication = c("humira/adalimumab, prednisone, azathioprine, rowasa", "lialda, mercaptopurine, asacol"),
dosage = c("40mg, 15mg, 30mg, 9mg", "20mg, 1g, 1600mg"),
most_recent_med = c("rowasa", "asacol"),
most_recent_dose = c("9mg", "1600mg"), stringsAsFactors = FALSE)
Thanks. 谢谢。
Here are some different approaches: 以下是一些不同的方法:
1) sqldf 1)sqldf
library(sqldf)
sqldf("select ID,
ever_smoked,
alcoholic,
group_concat(medication) as medication,
group_concat(dosage) as dosage,
medication as last_medication,
dosage as last_doage
from have
group by ID")
giving: 给予:
ID ever_smoked alcoholic medication dosage last_medication last_doage
1 1 no no humira/adalimumab,prednisone,azathioprine,rowasa 40mg,15mg,30mg,9mg rowasa 9mg
2 2 yes no lialda,mercaptopurine,asacol 20mg,1g,1600mg asacol 1600mg
2) data.table 2)data.table
library(data.table)
have_dt <- data.table(have)
have_dt[, list(medication = toString(medication),
dosage = toString(dosage),
last_medication = medication[.N],
last_dosage = dosage[.N]),
by = "ID,ever_smoked,alcoholic"]
giving: 给予:
ID ever_smoked alcoholic medication dosage last_medication last_dosage
1: 1 no no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg rowasa 9mg
2: 2 yes no lialda, mercaptopurine, asacol 20mg, 1g, 1600mg asacol 1600mg
3) base - by 3)根据-
do.call("rbind", by(have, have$ID, with, data.frame(
ID = ID[1],
ever_smoked = ever_smoked[1],
alcoholic = alcoholic[1],
medication = toString(medication),
dosage = toString(dosage),
last_medication = tail(medication, 1),
last_dosage = tail(dosage, 1))))
giving: 给予:
ID ever_smoked alcoholic medication dosage last_medication last_dosage
1 1 no no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg rowasa 9mg
2 2 yes no lialda, mercaptopurine, asacol 20mg, 1g, 1600mg asacol 1600mg
Note that this could alternately be written as: 请注意,这可以替代地写为:
do.call("rbind", by(have, have$ID, function(x) with(x, data.frame(
ID = ID[1],
ever_smoked = ever_smoked[1],
alcoholic = alcoholic[1],
medication = toString(medication),
dosage = toString(dosage),
last_medication = tail(medication, 1),
last_dosage = tail(dosage, 1)))))
4) base - aggregate 4)基础-汇总
aggregate(. ~ ID + ever_smoked + alcoholic, have,
function(x) c(values = toString(x), last = as.character(tail(x, 1))))
giving: 给予:
ID ever_smoked alcoholic medication.values medication.last dosage.values dosage.last
1 1 no no humira/adalimumab, prednisone, azathioprine, rowasa rowasa 40mg, 15mg, 30mg, 9mg 9mg
2 2 yes no lialda, mercaptopurine, asacol asacol 20mg, 1g, 1600mg 1600mg
Note that this returns a 2 x 5 data frame in which the last two columns are each 2 column matrices which can be more convenient for indexing than the flattened form but if flattened is preferred then: do.call("data.frame", DF)
请注意,这将返回一个2 x 5数据帧,其中最后两列均为2列矩阵,比平展形式更便于索引,但如果优选平展,则:
do.call("data.frame", DF)
This is a summary process, you can use summarise_all
and pass two functions to summarize each column: one to collapse the column with toString
, one to take the last row with last
: 这是一个汇总过程,您可以使用
summarise_all
并传递两个函数来汇总每个列:一个使用toString
折叠该列,一个使用last
折叠最后一行:
have %>%
group_by(ID, ever_smoked, alcoholic) %>%
summarise_all(funs(toString(.), most_recent = last(.)))
# A tibble: 2 x 7
# Groups: ID, ever_smoked [?]
# ID ever_smoked alcoholic medication_toString dosage_toString medication_most_recent dosage_most_recent
# <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1 no no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg rowasa 9mg
#2 2 yes no lialda, mercaptopurine, asacol 20mg, 1g, 1600mg asacol 1600mg
Assume ever_smoked and alcoholic are unique for each ID here. 假设ever_smoked和酒精对于这里的每个ID都是唯一的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.