在R中仅折叠一些长到宽格式的变量

Question

I am relatively new to R, and every time I need to "reshape" data, I am absolutely baffled. 我对R比较陌生，每当我需要“重塑”数据时，我都会感到困惑。 I have data that looks like this: 我有看起来像这样的数据：

HAVE: 有：

  ID ever_smoked alcoholic        medication dosage
1  1          no        no humira/adalimumab   40mg
2  1          no        no        prednisone   15mg
3  1          no        no      azathioprine   30mg
4  1          no        no            rowasa    9mg
5  2         yes        no            lialda   20mg
6  2         yes        no    mercaptopurine     1g
7  2         yes        no            asacol 1600mg

WANT: 想：

 ID  ever_smoked  alcoholic  medication
1  1          no        no   humira/adalimumab, prednisone, azathioprine, rowasa
2  2         yes        no   lialda, mercaptopurine, asacol

  dosage                  most_recent_med     most_recent_dose
1 40mg, 15mg, 30mg, 9mg   rowasa              9mg
2 20mg, 1g, 1600mg        asacol              1600mg

(Please note that it should be 2 observations and 7 variables). （请注意，它应该是2个观察值和7个变量）。

In essence, I want to (1) only collapse a few of the variables, and (2) retain the first row of the other variables, and also (3) create 2 new variables based on the last entries of some of the variables. 本质上，我想（1）仅折叠一些变量，并且（2）保留其他变量的第一行，并且（3）基于某些变量的最后一个条目创建2个新变量。

code to reproduce: 复制代码：

have <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2),
    ever_smoked = c("no", "no", "no", "no", "yes", "yes", "yes"), 
    alcoholic = c("no", "no", "no", "no", "no", "no", "no"),
    medication = c("humira/adalimumab", "prednisone", "azathioprine", "rowasa", "lialda", "mercaptopurine", "asacol"),
    dosage = c("40mg", "15mg", "30mg", "9mg", "20mg", "1g", "1600mg"), stringsAsFactors = FALSE)

want <- data.frame(ID = c(1, 2),
    ever_smoked = c("no", "yes"), 
    alcoholic = c("no", "no"),
    medication = c("humira/adalimumab, prednisone, azathioprine, rowasa", "lialda, mercaptopurine, asacol"),
    dosage = c("40mg, 15mg, 30mg, 9mg", "20mg, 1g, 1600mg"),
    most_recent_med = c("rowasa", "asacol"),
    most_recent_dose = c("9mg", "1600mg"), stringsAsFactors = FALSE)

Thanks. 谢谢。

Answer 1

Here are some different approaches: 以下是一些不同的方法：

1) sqldf 1）sqldf

library(sqldf)
sqldf("select ID, 
              ever_smoked, 
              alcoholic, 
              group_concat(medication) as medication,
              group_concat(dosage) as dosage, 
              medication as last_medication, 
              dosage as last_doage
        from have
        group by ID")

giving: 给予：

  ID ever_smoked alcoholic                                       medication             dosage last_medication last_doage
1  1          no        no humira/adalimumab,prednisone,azathioprine,rowasa 40mg,15mg,30mg,9mg          rowasa        9mg
2  2         yes        no                     lialda,mercaptopurine,asacol     20mg,1g,1600mg          asacol     1600mg

2) data.table 2）data.table

library(data.table)
have_dt <- data.table(have)
have_dt[, list(medication = toString(medication),
               dosage = toString(dosage),
               last_medication = medication[.N],
               last_dosage = dosage[.N]),
           by = "ID,ever_smoked,alcoholic"]

giving: 给予：

   ID ever_smoked alcoholic                                          medication                dosage last_medication last_dosage
1:  1          no        no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg          rowasa         9mg
2:  2         yes        no                      lialda, mercaptopurine, asacol      20mg, 1g, 1600mg          asacol      1600mg

3) base - by 3）根据-

do.call("rbind", by(have, have$ID, with, data.frame(
     ID = ID[1], 
     ever_smoked = ever_smoked[1], 
     alcoholic = alcoholic[1],
     medication = toString(medication),
     dosage = toString(dosage),
     last_medication = tail(medication, 1),
     last_dosage = tail(dosage, 1))))

giving: 给予：

  ID ever_smoked alcoholic                                          medication                dosage last_medication last_dosage
1  1          no        no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg          rowasa         9mg
2  2         yes        no                      lialda, mercaptopurine, asacol      20mg, 1g, 1600mg          asacol      1600mg

Note that this could alternately be written as: 请注意，这可以替代地写为：

do.call("rbind", by(have, have$ID, function(x) with(x, data.frame(
     ID = ID[1], 
     ever_smoked = ever_smoked[1], 
     alcoholic = alcoholic[1],
     medication = toString(medication),
     dosage = toString(dosage),
     last_medication = tail(medication, 1),
     last_dosage = tail(dosage, 1)))))

4) base - aggregate 4）基础-汇总

aggregate(. ~ ID + ever_smoked + alcoholic, have,
  function(x) c(values = toString(x), last = as.character(tail(x, 1))))

giving: 给予：

  ID ever_smoked alcoholic                                   medication.values medication.last         dosage.values dosage.last
1  1          no        no humira/adalimumab, prednisone, azathioprine, rowasa          rowasa 40mg, 15mg, 30mg, 9mg         9mg
2  2         yes        no                      lialda, mercaptopurine, asacol          asacol      20mg, 1g, 1600mg      1600mg

Note that this returns a 2 x 5 data frame in which the last two columns are each 2 column matrices which can be more convenient for indexing than the flattened form but if flattened is preferred then: do.call("data.frame", DF) 请注意，这将返回一个2 x 5数据帧，其中最后两列均为2列矩阵，比平展形式更便于索引，但如果优选平展，则： do.call("data.frame", DF)

Answer 2

This is a summary process, you can use summarise_all and pass two functions to summarize each column: one to collapse the column with toString , one to take the last row with last : 这是一个汇总过程，您可以使用summarise_all并传递两个函数来汇总每个列：一个使用toString折叠该列，一个使用last折叠最后一行：

have %>% 
    group_by(ID, ever_smoked, alcoholic) %>% 
    summarise_all(funs(toString(.), most_recent = last(.)))

# A tibble: 2 x 7
# Groups:   ID, ever_smoked [?]
#     ID ever_smoked alcoholic                                 medication_toString       dosage_toString medication_most_recent dosage_most_recent
#  <dbl>       <chr>     <chr>                                               <chr>                 <chr>                  <chr>              <chr>
#1     1          no        no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg                 rowasa                9mg
#2     2         yes        no                      lialda, mercaptopurine, asacol      20mg, 1g, 1600mg                 asacol             1600mg

_{Assume ever_smoked and alcoholic are unique for each ID here.} _{假设ever_smoked和酒精对于这里的每个ID都是唯一的。}

在R中仅折叠一些长到宽格式的变量

问题描述

2 个解决方案

解决方案1
4 2017-09-06 03:24:03

解决方案2
3 2017-09-06 01:38:33

在R中仅折叠一些长到宽格式的变量

问题描述

2 个解决方案

解决方案1 4 2017-09-06 03:24:03

解决方案2 3 2017-09-06 01:38:33

解决方案1
4 2017-09-06 03:24:03

解决方案2
3 2017-09-06 01:38:33