R长到宽格式的因子水平，作为二进制变量和日期

Question

I want to make a long to a wide format and use the factor Levels as binary variables. 我想使用长到宽的格式并将因子Levels用作二进制变量。 This means, if the factor Level is existing at least once, then there should be a 1 in the variable. 这意味着，如果因子水平至少存在一次，则变量中应为1。 Otherwise a 0. In addition, I want the dates as variable values date.1, date.2,... 否则为0。此外，我希望将日期作为变量值date.1，date.2，...

What I have is the following 我所拥有的是以下

data_sample <- data.frame(
  PatID  = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
  date   = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
  status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)

What I want is: 我想要的是：

PatID  COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1  date.Cardio.2  date.Cerebro.1
1        1    0       0    2016-12-14  2017-02-04     NA               NA          NA
2        0    1       0      NA           NA        2012-03-27         NA          NA 
3        1    1       1    2012-04-21     NA        2010-02-03    2011-03-05      2014-08-25

Answer 1

There are a few step to take but this should give you your desired output. 您可以采取一些步骤，但这应该可以为您提供所需的输出。

Note however that there seems to be a typo in the input data: I assume you meant "COPD" instead of "CPOD" because this is what you expected output tells me. 但是请注意，输入数据中似乎有一个拼写错误：我假设您的意思是"COPD"而不是"CPOD"因为这是您期望的输出告诉我的。

The first step is to make the string "NA" an explicit missing value, ie NA . 第一步是使字符串"NA"成为明确的缺失值，即NA 。

data_sample[data_sample == "NA"] <- NA

Now use data.table::dcast for the reshaping. 现在使用data.table::dcast进行重塑。

library(data.table)  
setDT(data_sample)

# create id column
data_sample[, id := rowid(status), by = PatID]
dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")

Finally join both data.tables 最后加入两个data.tables

out <- dt1[dt2, on = 'PatID']
out
#  PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
#1:     1      0       0    1  2016-12-14  2017-02-04          <NA>          <NA>           <NA>
#2:     2      1       0    0        <NA>        <NA>    2012-27-03          <NA>           <NA>
#3:     3      1       1    1  2012-04-21        <NA>    2010-02-03    2011-03-05     2014-08-25

data 数据

data_sample <- data.frame(
  PatID   = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
  date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
  status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))

R长到宽格式的因子水平，作为二进制变量和日期

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-11-30 10:17:37

R长到宽格式的因子水平，作为二进制变量和日期

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-11-30 10:17:37

解决方案1
0 已采纳 2018-11-30 10:17:37