R中匹配和计数字符串（DNA的k聚体）

Question

I have a list of strings (DNA sequence) including A,T,C,G. 我有一个字符串列表（DNA序列），包括A，T，C，G。 I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list. 我想找到所有匹配并插入到表中，其列是这些DNA字母表的所有可能组合（4 ^ k;“k”是每个匹配的长度 - K-mer - 并且必须由用户指定）并且行表示数字在列表中按顺序匹配。

Lets say my list includes 5 members: 让我们说我的名单包括5名成员：

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")

I want set k=2 (2-mer) so 4^2=16 combination are available including AA,AT,AC,AG,TA,TT,... 我想设置k=2 （2-mer），所以4^2=16组合可用，包括AA,AT,AC,AG,TA,TT,...

So my table will have 5 rows and 16 columns . 所以我的表将有5 rows 16 columns 。 I want to count number of matches between my k-mers and list members. 我想计算我的k-mers和列表成员之间的匹配数量。

My desired result: df: 我想要的结果： df:

lstMemb AA AT AC AG TA TT TC ...
  1     2  1  1  0  0  3  0
  2       ...
  3
  4
  5

Could you help me implement this in R? 你能帮我在R中实现吗？

Answer 1

May be this helps 可能这有帮助

 source("http://bioconductor.org/biocLite.R")
 biocLite("Biostrings")
 library(Biostrings)
 t(sapply(DNAlst, function(x){x1 <-  DNAString(x)
                   oligonucleotideFrequency(x1,2)}))
  #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
  #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
  #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
  #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
  #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
  #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0

Or as suggested by @Arun, convert the list to vector first 或者按照@Arun的建议，首先将list转换为vector

   oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L)
   #     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
   #[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
   #[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
   #[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
   #[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
   #[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0

Answer 2

If you are looking for speed the obvious solution is stringi package. 如果您正在寻找速度，显而易见的解决方案是stringi包。 There is stri_count_fixed function for counting patterns. stri_count_fixed函数用于计算模式。 And now, check the code and benchmark! 现在，检查代码和基准！

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
colnames(result) <- dna
result
     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
[2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
[3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
[4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
[5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0



fstri <- function(x){
    t(sapply(x, stri_count_fixed,dna,T))
}
fbio <- function(x){
    t(sapply(x, function(x){x1 <-  DNAString(x); oligonucleotideFrequency(x1,2)}))
}

all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
[1] TRUE

longDNA <- sample(DNAlst,100,T)
microbenchmark(fstri(longDNA),fbio(longDNA))
Unit: microseconds
           expr        min         lq        mean     median         uq        max neval
 fstri(longDNA)    689.378    738.184    825.3014    766.862    793.134   6027.039   100
  fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294   100
127245.489/766.862
## [1] 165.9301

Ca 165x times faster :) 加速165倍快 :)

Answer 3

My answer wasn't as fast as @bartektartanus. 我的回答没有@bartektartanus那么快。 However, it is also pretty fast and I wrote the code... :D 但是，它也很快，我写了代码......：D

The plus side of my code when compared to the others is: 与其他代码相比，我的代码的正面是：

Don't need to install the unimplemented version of stri_count_fixed 不需要安装未实现的stri_count_fixed版本
Probably stringi package will get really slow for big k-mers since it has to generate all possible combinations for pattern and afterwards, check their existence in the data and count how many times it appears. 可能stringi包对于大型k-mers来说会变得很慢，因为它必须为模式生成所有可能的组合，然后检查它们在数据中的存在并计算它出现的次数。
It also works for long single and and multiple sequences with the same output really fast. 它也适用于长单个和多个序列，具有相同的输出非常快。
You can put a value for k instead of creating a pattern string. 您可以为k设置值，而不是创建模式字符串。
If you run oligonucleotideFrequency with a k bigger than 12 in a big sequence, the function freezes for excess of memory use and R is restarted, while with my function it runs pretty fast. 如果您运行oligonucleotideFrequency与k在一个大的序列比12大，功能冻结过量内存使用和R重新启动，同时与我的功能它运行非常快。

My code 我的代码

sequence_kmers <- function(sequence, k){
    k_mers <- lapply(sequence,function(x){
        seq_loop_size <- length(DNAString(x))-k+1

        kmers <- sapply(1:seq_loop_size, function(z){
            y <- z + k -1
            kmer <- substr(x=x, start=z, stop=y)
            return(kmer)
        })
        return(kmers)
    })

    uniq <- unique(unlist(k_mers))
    ind <- t(sapply(k_mers, function(x){
        tabulate(match(x, uniq), length(uniq))
    }))
    colnames(ind) <- uniq

    return(ind)
}

I use the Biostrings package only to count the bases... you can use other options like stringi to count... if you remove all code below k_mers lapply and return(k_mers) it returns just the list... of all k-mers with the respective repeated vectors 我只使用Biostrings包计算基数...你可以使用其他选项如stringi来计算...如果你删除k_mers lapply下面的所有代码并return(k_mers)它只返回所有k-的列表...使用相应的重复向量

`sequence` here is a sequence of 1000bp `sequence`这里是1000bp的序列

#same output for 1 or multiple sequences
> sequence_kmers(sequence,4)[,1:10]
GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG 
   4    4    3    4    4    8    6    4    5    5 
> sequence_kmers(c(sequence,sequence),4)[,1:10]
     GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG
[1,]    4    4    3    4    4    8    6    4    5    5
[2,]    4    4    3    4    4    8    6    4    5    5

Tests done with my function: 用我的功能完成的测试：

#super fast for 1 sequence
> system.time({sequence_kmers(sequence,13)})
  usuário   sistema decorrido 
     0.08      0.00      0.08 

#works fast for 1 sequence or 50 sequences of 1000bps
> system.time({sequence_kmers(rep(sequence,50),4)})
     user    system   elapsed
     3.61      0.00      3.61 

#same speed for 3-mers or 13-mers
> system.time({sequence_kmers(rep(sequence,50),13)})
     user    system   elapsed
     3.63      0.00      3.62

Tests done with Biostrings : 使用Biostrings测试：

#Slow 1 sequence 12-mers
> system.time({oligonucleotideFrequency(DNAString(sequence),12)})
     user    system   elapsed 
   150.11      1.14    151.37 

#Biostrings package freezes for a single sequence of 13-mers
> system.time({oligonucleotideFrequency(sequence,13)})  
freezes, used all my 8gb RAM

Answer 4

We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. 我们最近发布了我们的'kebabs'包，作为Bioconductor 3.0版本的一部分。 Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality for computing k-mer frequencies efficiently, too: 虽然该软件包旨在为分类，回归和其他任务（如基于相似性的聚类）提供序列内核，但该软件包还包括有效计算k-mer频率的功能：

#installing kebabs:
#source("http://bioconductor.org/biocLite.R")
#biocLite(c("kebabs", "Biostrings"))
library(kebabs)

s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s1, 200), collate=""))
s2

sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq

So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. 因此，您可以看到k-mer频率是作为k = 13的标准（非标准化）谱内核的显式特征向量获得的。 This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). 此函数在高效的C ++代码中实现，该代码构建前缀树并且仅考虑序列中实际出现的k-mers（如您所请求的）。 You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). 你会看到即使对于k = 13和具有数万个碱基的序列，计算也只需要几分之一秒（在我们5岁的戴尔服务器上为19毫秒）。 The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. 上述函数也适用于DNAStringSets，但在这种情况下，您应该删除drop（）以获得k-mer频率矩阵。 The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences): 默认情况下，矩阵是稀疏的（类'dgRMatrix'），但您也可以将结果强制为标准密集矩阵格式（但是，仍然省略任何序列中根本不存在的k-mers）：

sv <- c(DNAStringSet(s1), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq

How long the k-mers may be, may depend on your system. k-mers可能有多长，可能取决于你的系统。 On our system, the limit seems to be k=22 for DNA sequences. 在我们的系统中，DNA序列的限制似乎是k = 22。 The same works for RNA and amino acid sequences. 对于RNA和氨基酸序列也是如此。 For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k. 然而，对于后者，k的限制明显更低，因为对于相同的k，特征空间明显更大。

#for the kebabs documentation please see:
browseVignettes("kebabs")

I hope that helps. 我希望有所帮助。 If you have any further questions, please let me know. 如果您还有其他问题，请告诉我。

Best regards, Ulrich 最好的问候，乌尔里希

Answer 5

Another way to do this: 另一种方法：

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA","ACACACACACCA")
len <- 4
stri_sub_fun <- function(x) table(stri_sub(x,1:(stri_length(x)-len+1),length = len))
sapply(DNAlst, stri_sub_fun)
[[1]]

AAAC AACT ACTG ATTT CAAA CTGA GATT TGAT TTTT 
   1    1    1    1    1    1    1    1    1 

[[2]]

AAAA AAAG AAAT AAGT AATA ACCG AGTA ATAC ATGA GAAA GATG GTAA TAAA TACC TGAA 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 

[[3]]

ATGC ATTA TATG TTAT 
   1    1    1    1 

[[4]]

TGGA 
   1 

[[5]]

ATCA CATC CGCA CGCG GCAT GCGC TCAA 
   1    1    1    1    1    1    1 

[[6]]

ACAC ACCA CACA CACC 
   4    1    3    1

R中匹配和计数字符串（DNA的k聚体）

问题描述

5 个解决方案

解决方案1
6 2014-10-28 04:24:07

解决方案2
6 已采纳 2014-10-28 15:45:53

解决方案3
4 2014-11-27 21:15:14

The plus side of my code when compared to the others is: 与其他代码相比，我的代码的正面是：

My code 我的代码

`sequence` here is a sequence of 1000bp `sequence`这里是1000bp的序列

解决方案4
4 2014-11-28 11:09:50

解决方案5
2

R中匹配和计数字符串（DNA的k聚体）

问题描述

5 个解决方案

解决方案1 6 2014-10-28 04:24:07

解决方案2 6 已采纳 2014-10-28 15:45:53

解决方案3 4 2014-11-27 21:15:14

The plus side of my code when compared to the others is: 与其他代码相比，我的代码的正面是：

My code 我的代码

sequence here is a sequence of 1000bp sequence这里是1000bp的序列

解决方案4 4 2014-11-28 11:09:50

解决方案5 2

解决方案1
6 2014-10-28 04:24:07

解决方案2
6 已采纳 2014-10-28 15:45:53

解决方案3
4 2014-11-27 21:15:14

`sequence` here is a sequence of 1000bp `sequence`这里是1000bp的序列

解决方案4
4 2014-11-28 11:09:50

解决方案5
2