[英]R data frame group by a column and create repetition number of another column
I have a data frame like this: 我有一个这样的数据框:
ii <- data.frame(cid = c(rep('a',8),rep('b',5)),
Interaction = c(rep('VCS',3), c('SLS'), rep('TCU',2), rep('MFM',2), rep('SLS', 2), 'COMM', rep('MFM',2)),
stringsAsFactors = F
)
cid Interaction
1 a VCS
2 a VCS
3 a VCS
4 a SLS
5 a TCU
6 a TCU
7 a MFM
8 a MFM
9 b SLS
10 b SLS
11 b COMM
12 b MFM
13 b MFM
And I would like to first group by cid
then create another column that shows repetition number of Interaction
columns. 我想首先按
cid
分组,然后创建另一个列,显示Interaction
列的重复次数。 The result should look like this: 结果应如下所示:
cid Interaction replicate
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
Eventually I wanted to also reshape this to a wide format (couldn't do it with the current format cause I lose duplicates) that would resemble something like: 最终我想要将其重新整形为宽格式(无法使用当前格式,因为我丢失了重复项),这类似于:
cid InteractionTuple
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM;MFM1;MFM2
to be able to run association rule mining techniques, that are currently supporting repeated items per transaction. 能够运行关联规则挖掘技术,这些技术目前支持每个事务的重复项目。
Using dplyr: 使用dplyr:
library(dplyr)
ii %>%
group_by(cid, Interaction) %>%
mutate(Interaction_rn = paste0(Interaction, row_number())) %>%
group_by(cid) %>%
summarise(InteractionTuple = paste(Interaction_rn, collapse = ";"))
# # A tibble: 2 x 2
# cid InteractionTuple
# <chr> <chr>
# 1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
# 2 b SLS1;SLS2;COMM1;MFM1;MFM2
Here's a data.table solution 这是一个data.table解决方案
library(data.table)
setDT(dt)
dt[ , "replicate" := 1:.N, by = .(Interaction, cid)]
cid Interaction replicate
1: a VCS 1
2: a VCS 2
3: a VCS 3
4: a SLS 1
5: a TCU 1
6: a TCU 2
7: a MFM 1
8: a MFM 2
9: b SLS 1
10: b SLS 2
11: b COMM 1
12: b MFM 1
13: b MFM 2
Edit part2: 编辑第2部分:
dt2 = dt[ , .("InteractionTuple" = paste(Interaction, replicate, sep = "", collapse = ";")), by = .(cid)]
> dt2
cid InteractionTuple
1: a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2: b SLS1;SLS2;COMM1;MFM1;MFM2
Edit2 EDIT2
@MikeH suggested a different way which might be faster. @MikeH提出了一种可能更快的不同方式。 Here are the results
结果如下
microbenchmark(dt2 = dt[ , .("replicate" = 1:.N), by = .(Interaction, cid)],
dt3 = dt[ , .("replicate" = seq_len(.N)), by = .(Interaction, cid)], times = 1000L)
Unit: microseconds
expr min lq mean median uq max neval
dt2 323.960 364.361 434.6370 402.8740 457.6220 2382.88 1000
dt3 318.296 360.585 508.1313 397.3985 461.5865 42750.25 1000
The median is a little bit better using seq_len(.N)
. 使用
seq_len(.N)
,中位数稍微好一点。
This answer based on dplyr
这个答案基于
dplyr
1st Part 第一部分
Q1=ii%>%group_by(cid,Interaction)%>%
mutate(replicate=rank(Interaction,ties.method="first"))
Q1
cid Interaction replicate
<chr> <chr> <int>
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
2nd Part 第二部分
Q2=Q1%>%group_by(cid)%>%
summarise(InteractionTuple=paste0(Interaction,replicate,collapse = ";"))
Q2
# A tibble: 2 × 2
cid InteractionTuple
<chr> <chr>
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM1;MFM1;MFM2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.