[英]How to know the frequency of each observation in a column and sort them in r?
I have a column and each row is a string. 我有一列,每一行都是一个字符串。 I want to find 1.The frequency of each sequence 2.Sort the result by frequency from high to low 3.If the frequency are the same for multiple strings, I sort them by alphabet of the sequence.
我想找到1.每个序列的频率2.按频率从高到低排序结果3.如果多个字符串的频率相同,则按序列的字母对它们进行排序。
My data looks like 我的数据看起来像
ID seq
1 1 BBBBBBIRBBRBBBB
2 2 BBBBBBIRRRRRBBB
3 3 BBBBBBIRRRRRRRR
4 4 BBBBBBITBBBBBBB
5 5 BBBBBBITBBBRBBX
6 6 BBBBBBITTTTBBCX
7 7 BBBBBBITTTTTTTT
8 8 BBBBBBOBBBBBBTX
9 9 BBBBBBOBBBBBBXB
10 10 BBBBBBIRBBRBBBB
11 11 BBBBBBIRRRRRBBB
12 12 BBBBBBIRRRRRRRR
13 13 BBBBBBITBBBBBBB
14 14 BBBBBBITBBBRBBX
15 15 BBBBBBIRBBRBBBB
16 16 BBBBBBIRRRRRBBB
17 17 BBBBBBIRRRRRRRR
18 18 BBBBBBIRBBRBBBB
19 19 BBBBBBIRRRRRBBB
20 20 BBBBBBIRRRRRBBB
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
seq<-c('BBBBBBIRBBRBBBB','BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR', 'BBBBBBITBBBBBBB', 'BBBBBBITBBBRBBX', 'BBBBBBITTTTBBCX', 'BBBBBBITTTTTTTT', 'BBBBBBOBBBBBBTX', 'BBBBBBOBBBBBBXB', 'BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR', 'BBBBBBITBBBBBBB', 'BBBBBBITBBBRBBX', 'BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR', 'BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB')
data.frame(ID,seq)
I want the result looks like this 我希望结果看起来像这样
sequence Frequency
BBBBBBIRBBRBBBB 5
BBBBBBIRRRRRBBB 4
BBBBBBIRRRRRRRR 3
BBBBBBITBBBBBBB 2
BBBBBBITBBBRBBX 1
BBBBBBITTTTBBCX 1
BBBBBBITTTTTTTT 1
BBBBBBOBBBBBBTX 1
BBBBBBOBBBBBBXB 1
Thanks in advance!! 提前致谢!!
Can do this with data.table
: 可以使用
data.table
做到这data.table
:
library(data.table)
setDT(df)[, .N, by = seq][order(-N)]
It is worth noting that data.table
consistently beats dplyr
in terms of speed on different sample sizes: 值得注意的是,在不同样本量的速度方面,
data.table
始终击败dplyr
:
Number of the top is how many times original sample was repeated. 顶端数是重复原始样品的次数。
Here is the code to reproduce: 这是要重现的代码:
library(data.table)
library(dplyr)
dtWay <- function(ID, seq) {
dt <- data.table(ID, seq);
setkey(dt, seq);
return(dt[, .N, by = seq][order(-N)])
}
dplyrWay <- function(ID, seq) {
df <- data.frame(ID, seq)
res <- df %>%
dplyr::group_by(seq) %>%
dplyr::summarize(frequency = length(ID)) %>%
dplyr::arrange(desc(frequency)) %>%
dplyr::rename(sequence = seq)
return (res)
}
N <- c(3, 4, 5, 6)
n <- 10^N
library(microbenchmark)
dev.off()
par( mfrow = c( 2, 2 ) )
res <- lapply(n, function(x) {
ID <-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
ID <- rep(ID, times = x)
seq<-c('BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR', 'BBBBBBITBBBBBBB', 'BBBBBBITBBBRBBX', 'BBBBBBITTTTBBCX', 'BBBBBBITTTTTTTT', 'BBBBBBOBBBBBBTX', 'BBBBBBOBBBBBBXB', 'BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR', 'BBBBBBITBBBBBBB', 'BBBBBBITBBBRBBX', 'BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR', 'BBBBBBIRBBRBBBB', 'BBBBBBIRRRRRBBB')
seq <- rep(seq, times = x)
m <- microbenchmark( "data.table" = dtWay(ID, seq),
"dplyr" = dplyrWay(ID, seq),
times = 10, unit = "sc")
a <- boxplot(m, main = x, xlab ="", ylab = "time")
})
If you want to exert more control over the sorting and names you could use the following dplyr functions. 如果要对排序和名称施加更多控制,可以使用以下dplyr函数。
library(dplyr)
# assumes df is a data frame with seq and ID columns
df %>%
group_by(sequence = seq) %>%
summarize(frequency = length(ID)) %>%
arrange(-frequency)
I like dplyr
. 我喜欢
dplyr
。
install.packages('dplyr')
library(dplyr)
df <- group_by(df, seq)
df <- count(df, seq)
count(df, seq)
Source: local data frame [9 x 2]
seq n
(fctr) (int)
1 BBBBBBIRBBRBBBB 4
2 BBBBBBIRRRRRBBB 4
3 BBBBBBIRRRRRRRR 3
4 BBBBBBITBBBBBBB 2
5 BBBBBBITBBBRBBX 2
6 BBBBBBITTTTBBCX 1
7 BBBBBBITTTTTTTT 1
8 BBBBBBOBBBBBBTX 1
9 BBBBBBOBBBBBBXB 1
That looks like your desired output, no? 看起来像您想要的输出,不是吗? Not sure why it there's only 4 counts of the first sequence, though.
不确定为什么第一个序列只有4个计数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.