繁体   English   中英

如何知道一列中每个观测的频率并将它们按r排序?

[英]How to know the frequency of each observation in a column and sort them in r?

我有一列,每一行都是一个字符串。 我想找到1.每个序列的频率2.按频率从高到低排序结果3.如果多个字符串的频率相同,则按序列的字母对它们进行排序。

我的数据看起来像

   ID             seq
1   1 BBBBBBIRBBRBBBB
2   2 BBBBBBIRRRRRBBB
3   3 BBBBBBIRRRRRRRR
4   4 BBBBBBITBBBBBBB
5   5 BBBBBBITBBBRBBX
6   6 BBBBBBITTTTBBCX
7   7 BBBBBBITTTTTTTT
8   8 BBBBBBOBBBBBBTX
9   9 BBBBBBOBBBBBBXB
10 10 BBBBBBIRBBRBBBB
11 11 BBBBBBIRRRRRBBB
12 12 BBBBBBIRRRRRRRR
13 13 BBBBBBITBBBBBBB
14 14 BBBBBBITBBBRBBX
15 15 BBBBBBIRBBRBBBB
16 16 BBBBBBIRRRRRBBB
17 17 BBBBBBIRRRRRRRR
18 18 BBBBBBIRBBRBBBB
19 19 BBBBBBIRRRRRBBB
20 20 BBBBBBIRRRRRBBB

ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
seq<-c('BBBBBBIRBBRBBBB','BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB', 'BBBBBBIRRRRRRRR',  'BBBBBBITBBBBBBB',  'BBBBBBITBBBRBBX',  'BBBBBBITTTTBBCX',  'BBBBBBITTTTTTTT',  'BBBBBBOBBBBBBTX',  'BBBBBBOBBBBBBXB',  'BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB',  'BBBBBBIRRRRRRRR',  'BBBBBBITBBBBBBB',  'BBBBBBITBBBRBBX',  'BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB',  'BBBBBBIRRRRRRRR',  'BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB')
data.frame(ID,seq)

我希望结果看起来像这样

sequence        Frequency
BBBBBBIRBBRBBBB 5
BBBBBBIRRRRRBBB 4
BBBBBBIRRRRRRRR 3
BBBBBBITBBBBBBB 2
BBBBBBITBBBRBBX 1
BBBBBBITTTTBBCX 1
BBBBBBITTTTTTTT 1
BBBBBBOBBBBBBTX 1
BBBBBBOBBBBBBXB 1

提前致谢!!

可以使用data.table做到这data.table

library(data.table)

setDT(df)[, .N, by = seq][order(-N)]

值得注意的是,在不同样本量的速度方面, data.table始终击败dplyr

在此处输入图片说明

顶端数是重复原始样品的次数。

这是要重现的代码:

library(data.table)
library(dplyr)
dtWay <- function(ID, seq) {
  dt <- data.table(ID, seq);
  setkey(dt, seq);
  return(dt[, .N, by = seq][order(-N)])
}
dplyrWay <- function(ID, seq) {
  df <- data.frame(ID, seq)
  res <- df %>% 
    dplyr::group_by(seq) %>% 
    dplyr::summarize(frequency = length(ID)) %>% 
    dplyr::arrange(desc(frequency)) %>%
    dplyr::rename(sequence = seq)
  return (res)
}

N <- c(3, 4, 5, 6)
n <- 10^N

library(microbenchmark)
dev.off()
par( mfrow = c( 2, 2 ) )
res <- lapply(n, function(x) {

  ID <-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
  ID <- rep(ID, times = x)
  seq<-c('BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB',   'BBBBBBIRRRRRRRR',  'BBBBBBITBBBBBBB',  'BBBBBBITBBBRBBX',  'BBBBBBITTTTBBCX',  'BBBBBBITTTTTTTT',  'BBBBBBOBBBBBBTX',  'BBBBBBOBBBBBBXB',  'BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB',  'BBBBBBIRRRRRRRR',  'BBBBBBITBBBBBBB',  'BBBBBBITBBBRBBX',  'BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB',  'BBBBBBIRRRRRRRR',  'BBBBBBIRBBRBBBB',  'BBBBBBIRRRRRBBB')
  seq  <- rep(seq, times = x)

  m <- microbenchmark( "data.table" = dtWay(ID, seq),
                       "dplyr" = dplyrWay(ID, seq),
                       times = 10, unit = "sc")

  a <- boxplot(m, main = x, xlab ="", ylab = "time")
})

如果要对排序和名称施加更多控制,可以使用以下dplyr函数。

library(dplyr)
# assumes df is a data frame with seq and ID columns
df %>% 
  group_by(sequence = seq) %>% 
  summarize(frequency = length(ID)) %>% 
  arrange(-frequency)

我喜欢dplyr

install.packages('dplyr')
library(dplyr)

df <- group_by(df, seq)
df <- count(df, seq)

count(df, seq)
Source: local data frame [9 x 2]

          seq     n
       (fctr) (int)
 1 BBBBBBIRBBRBBBB     4
 2 BBBBBBIRRRRRBBB     4
 3 BBBBBBIRRRRRRRR     3
 4 BBBBBBITBBBBBBB     2
 5 BBBBBBITBBBRBBX     2
 6 BBBBBBITTTTBBCX     1
 7 BBBBBBITTTTTTTT     1
 8 BBBBBBOBBBBBBTX     1
 9 BBBBBBOBBBBBBXB     1

看起来像您想要的输出,不是吗? 不确定为什么第一个序列只有4个计数。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM