[英]R long to wide format factor levels as binary variables and dates
I want to make a long to a wide format and use the factor Levels as binary variables. 我想使用长到宽的格式并将因子Levels用作二进制变量。 This means, if the factor Level is existing at least once, then there should be a 1 in the variable. 这意味着,如果因子水平至少存在一次,则变量中应为1。 Otherwise a 0. In addition, I want the dates as variable values date.1, date.2,... 否则为0。此外,我希望将日期作为变量值date.1,date.2,...
What I have is the following 我所拥有的是以下
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)
What I want is: 我想要的是:
PatID COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1 date.Cardio.2 date.Cerebro.1
1 1 0 0 2016-12-14 2017-02-04 NA NA NA
2 0 1 0 NA NA 2012-03-27 NA NA
3 1 1 1 2012-04-21 NA 2010-02-03 2011-03-05 2014-08-25
There are a few step to take but this should give you your desired output. 您可以采取一些步骤,但这应该可以为您提供所需的输出。
Note however that there seems to be a typo in the input data: I assume you meant "COPD"
instead of "CPOD"
because this is what you expected output tells me. 但是请注意,输入数据中似乎有一个拼写错误:我假设您的意思是"COPD"
而不是"CPOD"
因为这是您期望的输出告诉我的。
The first step is to make the string "NA"
an explicit missing value, ie NA
. 第一步是使字符串"NA"
成为明确的缺失值,即NA
。
data_sample[data_sample == "NA"] <- NA
Now use data.table::dcast
for the reshaping. 现在使用data.table::dcast
进行重塑。
library(data.table)
setDT(data_sample)
# create id column
data_sample[, id := rowid(status), by = PatID]
dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")
Finally join both data.tables 最后加入两个data.tables
out <- dt1[dt2, on = 'PatID']
out
# PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
#1: 1 0 0 1 2016-12-14 2017-02-04 <NA> <NA> <NA>
#2: 2 1 0 0 <NA> <NA> 2012-27-03 <NA> <NA>
#3: 3 1 1 1 2012-04-21 <NA> 2010-02-03 2011-03-05 2014-08-25
data 数据
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.