[英]R - Create a new column based on number of rows that satisfy requirements in another data frame
[英]R data frame group by a column and create repetition number of another column
我有一個這樣的數據框:
ii <- data.frame(cid = c(rep('a',8),rep('b',5)),
Interaction = c(rep('VCS',3), c('SLS'), rep('TCU',2), rep('MFM',2), rep('SLS', 2), 'COMM', rep('MFM',2)),
stringsAsFactors = F
)
cid Interaction
1 a VCS
2 a VCS
3 a VCS
4 a SLS
5 a TCU
6 a TCU
7 a MFM
8 a MFM
9 b SLS
10 b SLS
11 b COMM
12 b MFM
13 b MFM
我想首先按cid
分組,然后創建另一個列,顯示Interaction
列的重復次數。 結果應如下所示:
cid Interaction replicate
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
最終我想要將其重新整形為寬格式(無法使用當前格式,因為我丟失了重復項),這類似於:
cid InteractionTuple
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM;MFM1;MFM2
能夠運行關聯規則挖掘技術,這些技術目前支持每個事務的重復項目。
使用dplyr:
library(dplyr)
ii %>%
group_by(cid, Interaction) %>%
mutate(Interaction_rn = paste0(Interaction, row_number())) %>%
group_by(cid) %>%
summarise(InteractionTuple = paste(Interaction_rn, collapse = ";"))
# # A tibble: 2 x 2
# cid InteractionTuple
# <chr> <chr>
# 1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
# 2 b SLS1;SLS2;COMM1;MFM1;MFM2
這是一個data.table解決方案
library(data.table)
setDT(dt)
dt[ , "replicate" := 1:.N, by = .(Interaction, cid)]
cid Interaction replicate
1: a VCS 1
2: a VCS 2
3: a VCS 3
4: a SLS 1
5: a TCU 1
6: a TCU 2
7: a MFM 1
8: a MFM 2
9: b SLS 1
10: b SLS 2
11: b COMM 1
12: b MFM 1
13: b MFM 2
編輯第2部分:
dt2 = dt[ , .("InteractionTuple" = paste(Interaction, replicate, sep = "", collapse = ";")), by = .(cid)]
> dt2
cid InteractionTuple
1: a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2: b SLS1;SLS2;COMM1;MFM1;MFM2
EDIT2
@MikeH提出了一種可能更快的不同方式。 結果如下
microbenchmark(dt2 = dt[ , .("replicate" = 1:.N), by = .(Interaction, cid)],
dt3 = dt[ , .("replicate" = seq_len(.N)), by = .(Interaction, cid)], times = 1000L)
Unit: microseconds
expr min lq mean median uq max neval
dt2 323.960 364.361 434.6370 402.8740 457.6220 2382.88 1000
dt3 318.296 360.585 508.1313 397.3985 461.5865 42750.25 1000
使用seq_len(.N)
,中位數稍微好一點。
這個答案基於dplyr
第一部分
Q1=ii%>%group_by(cid,Interaction)%>%
mutate(replicate=rank(Interaction,ties.method="first"))
Q1
cid Interaction replicate
<chr> <chr> <int>
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
第二部分
Q2=Q1%>%group_by(cid)%>%
summarise(InteractionTuple=paste0(Interaction,replicate,collapse = ";"))
Q2
# A tibble: 2 × 2
cid InteractionTuple
<chr> <chr>
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM1;MFM1;MFM2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.