简体   繁体   中英

R - Find all sequences and their frequencies in a data frame

Please, I have this data.frame:

10  34  35  39  55  43
10  32  33  40  45  48
10  35  36  38  41  43
30  31  32  34  36  49
39  55  40  43  45  50
30  32  35  36  49  50
 2   8   9  39  55  43
 1   2   8  12  55  43
 2   8  12  55  43  61
 2   8  55  43  61  78

I'd like to find all sequences (where length > 2) for all rows and group by the frequency (where frequency > 1). In this case, need to show

sequence               frequency
[39  55  43]           3
[10  35  43]           2
[32  36  49]           2
[30  32  36]           2
[30  32  36  49]       2
[ 2   8  55]           4
[ 2   8  55  43]       4
[ 2   8  55  43  61]   2

Is it possible to do this in R?

You can write a function subseqs that can enumerate all sub-sequences of each row, then summarize the frequency using table

subseqs <- function(v) sapply(3:length(v), function(k) combn(v,k,FUN = toString))

f <- table(unlist(apply(df, 1, subseqs)),dnn = "sequence")

dfout <- data.frame(f[f>=2])

such that

> dfout
           sequence Freq
1        10, 35, 43    2
2        12, 55, 43    2
3         2, 12, 43    2
4         2, 12, 55    2
5     2, 12, 55, 43    2
6         2, 43, 61    2
7         2, 55, 43    4
8     2, 55, 43, 61    2
9         2, 55, 61    2
10         2, 8, 12    2
11     2, 8, 12, 43    2
12     2, 8, 12, 55    2
13 2, 8, 12, 55, 43    2
14         2, 8, 43    4
15     2, 8, 43, 61    2
16         2, 8, 55    4
17     2, 8, 55, 43    4
18 2, 8, 55, 43, 61    2
19     2, 8, 55, 61    2
20         2, 8, 61    2
21       30, 32, 36    2
22   30, 32, 36, 49    2
23       30, 32, 49    2
24       30, 36, 49    2
25       32, 36, 49    2
26       39, 55, 43    3
27       55, 43, 61    2
28        8, 12, 43    2
29        8, 12, 55    2
30    8, 12, 55, 43    2
31        8, 43, 61    2
32        8, 55, 43    4
33    8, 55, 43, 61    2
34        8, 55, 61    2

DATA

df <- structure(list(V1 = c(10L, 10L, 10L, 30L, 39L, 30L, 2L, 1L, 2L, 
2L), V2 = c(34L, 32L, 35L, 31L, 55L, 32L, 8L, 2L, 8L, 8L), V3 = c(35L, 
33L, 36L, 32L, 40L, 35L, 9L, 8L, 12L, 55L), V4 = c(39L, 40L, 
38L, 34L, 43L, 36L, 39L, 12L, 55L, 43L), V5 = c(55L, 45L, 41L, 
36L, 45L, 49L, 55L, 55L, 43L, 61L), V6 = c(43L, 48L, 43L, 49L, 
50L, 50L, 43L, 43L, 61L, 78L)), class = "data.frame", row.names = c(NA, 
-10L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM