简体   繁体   English

R数据帧组按列并创建另一列的重复次数

[英]R data frame group by a column and create repetition number of another column

I have a data frame like this: 我有一个这样的数据框:

ii <- data.frame(cid = c(rep('a',8),rep('b',5)),
                 Interaction = c(rep('VCS',3), c('SLS'), rep('TCU',2), rep('MFM',2), rep('SLS', 2), 'COMM', rep('MFM',2)),
                 stringsAsFactors = F
)

   cid Interaction
1    a         VCS
2    a         VCS
3    a         VCS
4    a         SLS
5    a         TCU
6    a         TCU
7    a         MFM
8    a         MFM
9    b         SLS
10   b         SLS
11   b        COMM
12   b         MFM
13   b         MFM

And I would like to first group by cid then create another column that shows repetition number of Interaction columns. 我想首先按cid分组,然后创建另一个列,显示Interaction列的重复次数。 The result should look like this: 结果应如下所示:

   cid Interaction   replicate
1    a         VCS   1
2    a         VCS   2
3    a         VCS   3
4    a         SLS   1
5    a         TCU   1
6    a         TCU   2
7    a         MFM   1
8    a         MFM   2
9    b         SLS   1
10   b         SLS   2
11   b        COMM   1
12   b         MFM   1
13   b         MFM   2

Eventually I wanted to also reshape this to a wide format (couldn't do it with the current format cause I lose duplicates) that would resemble something like: 最终我想要将其重新整形为宽格式(无法使用当前格式,因为我丢失了重复项),这类似于:

   cid         InteractionTuple   
1    a         VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2    b         SLS1;SLS2;COMM;MFM1;MFM2

to be able to run association rule mining techniques, that are currently supporting repeated items per transaction. 能够运行关联规则挖掘技术,这些技术目前支持每个事务的重复项目。

Using dplyr: 使用dplyr:

library(dplyr)

ii %>% 
  group_by(cid, Interaction) %>% 
  mutate(Interaction_rn = paste0(Interaction, row_number())) %>% 
  group_by(cid) %>% 
  summarise(InteractionTuple = paste(Interaction_rn, collapse = ";"))

# # A tibble: 2 x 2
#     cid                        InteractionTuple
#   <chr>                                   <chr>
# 1     a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
# 2     b               SLS1;SLS2;COMM1;MFM1;MFM2

Here's a data.table solution 这是一个data.table解决方案

library(data.table)
setDT(dt)
dt[ , "replicate" := 1:.N, by = .(Interaction, cid)]

    cid Interaction replicate
 1:   a         VCS         1
 2:   a         VCS         2
 3:   a         VCS         3
 4:   a         SLS         1
 5:   a         TCU         1
 6:   a         TCU         2
 7:   a         MFM         1
 8:   a         MFM         2
 9:   b         SLS         1
10:   b         SLS         2
11:   b        COMM         1
12:   b         MFM         1
13:   b         MFM         2

Edit part2: 编辑第2部分:

dt2 = dt[ , .("InteractionTuple" = paste(Interaction, replicate, sep = "", collapse = ";")), by = .(cid)]
> dt2
   cid                        InteractionTuple
1:   a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2:   b               SLS1;SLS2;COMM1;MFM1;MFM2

Edit2 EDIT2

@MikeH suggested a different way which might be faster. @MikeH提出了一种可能更快的不同方式。 Here are the results 结果如下

microbenchmark(dt2 = dt[ , .("replicate" = 1:.N), by = .(Interaction, cid)],
               dt3 = dt[ , .("replicate" = seq_len(.N)), by = .(Interaction, cid)], times = 1000L)

Unit: microseconds
 expr     min      lq     mean   median       uq      max neval
  dt2 323.960 364.361 434.6370 402.8740 457.6220  2382.88  1000
  dt3 318.296 360.585 508.1313 397.3985 461.5865 42750.25  1000

The median is a little bit better using seq_len(.N) . 使用seq_len(.N) ,中位数稍微好一点。

This answer based on dplyr 这个答案基于dplyr
1st Part 第一部分

Q1=ii%>%group_by(cid,Interaction)%>%
       mutate(replicate=rank(Interaction,ties.method="first"))
Q1
     cid Interaction replicate
   <chr>       <chr>     <int>
1      a         VCS         1
2      a         VCS         2
3      a         VCS         3
4      a         SLS         1
5      a         TCU         1
6      a         TCU         2
7      a         MFM         1
8      a         MFM         2
9      b         SLS         1
10     b         SLS         2
11     b        COMM         1
12     b         MFM         1
13     b         MFM         2

2nd Part 第二部分

Q2=Q1%>%group_by(cid)%>%
        summarise(InteractionTuple=paste0(Interaction,replicate,collapse = ";"))
Q2
    # A tibble: 2 × 2
        cid                        InteractionTuple
      <chr>                                   <chr>
    1     a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
    2     b               SLS1;SLS2;COMM1;MFM1;MFM2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM