简体   繁体   English

列出序列的所有突变(DNA)

[英]Making a list of all mutations of a sequence (DNA)

I have a DNA sequence, and I want to find all instances of it, or any of its possible mutations in a list of DNA sequence reads. 我有一个DNA序列,我想在DNA序列读数列表中找到它的所有实例或其任何可能的突变。 I am using grepl to do this, since it is faster than matchPattern in the instance I am using it. 我使用grepl来执行此操作,因为它在我使用它的实例中比matchPattern更快。 I use parLapply to feed my vector of mutations to the grepl function. 我使用parLapply将我的变异载体提供给grepl函数。 But what I am interested in doing is making an easy way of auto-generating my vector of sequence mutations. 但我感兴趣的是做一个自动生成序列突变载体的简单方法。 Originally I typed each mutation, but that leaves room for human error, and if the sequence is lengthened, more mutations would need to be typed. 最初我输入了每个突变,但这留下了人为错误的空间,如果序列延长,则需要输入更多突变。 In addition, my current code only allows 1 mutation, and some sequences should allow for more mutations than others. 此外,我目前的代码只允许1个突变,而某些序列应该允许比其他序列更多的突变。 I am not looking for someone to write a loop for me, but just give me a suggestion for accounting for any string. 我不是在寻找有人为我写一个循环,但只是给我一个会计任何字符串的建议。

Right now, I have a semi-automated way of generating the mutations. 现在,我有一种产生突变的半自动方式。 It now generates the vector without me typing them all out, but only works for 8 nucleotide long sequences. 它现在生成载体而不是我全部输入,但仅适用于8个核苷酸长的序列。 There has to be a better way to generate the vector for any nucleotide sequence of any length. 必须有更好的方法来生成任何长度的任何核苷酸序列的载体。

This is my code: 这是我的代码:

#My sequence of interest
seq1 <- "GGCGACTG"
lenseq1 <- nchar(seq1)

#A vector of the length of the sequence I wish to create all mutations of
mutsinseq1 <- rep(seq1, 5*lenseq1+4*(lenseq1-1)+1)

#The possible substitutions, insertions, and deletions to the sequence of interest
possnuc <- c("A","T","C","G","")
lenpossnuc <- length(possnuc)

#changing all elements of the vector except for the first
#the first 8 if statements are nucleotide substitutions or deletions
#the other if statements allow for inserts between nucleotides
for(i in 2:length(mutsinseq1)){
  if(i<7){
    mutsinseq1[i] <- paste(possnuc[i-1],substr(seq1,2,lenseq1),sep = "") 
  } else if(i<12){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-6],substr(seq1,3,lenseq1),sep = "")
  } else if(i<17){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-11],substr(seq1,4,lenseq1),sep = "")
  } else if(i<22){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-16],substr(seq1,5,lenseq1),sep = "")
  } else if(i<27){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-21],substr(seq1,6,lenseq1),sep = "")
  } else if(i<32){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-26],substr(seq1,7,lenseq1),sep = "")
  } else if(i<37){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-31],substr(seq1,8,lenseq1),sep = "")
  } else if(i<42){
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-36],sep = "")
  } else if(i<46){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-41],substr(seq1,2,lenseq1),sep = "")
  } else if(i<50){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-45],substr(seq1,3,lenseq1),sep = "")
  } else if(i<54){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-49],substr(seq1,4,lenseq1),sep = "")
  } else if(i<58){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-53],substr(seq1,5,lenseq1),sep = "")
  } else if(i<62){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-57],substr(seq1,6,lenseq1),sep = "")
  } else if(i<66){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-61],substr(seq1,7,lenseq1),sep = "")
  } else{
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-65],substr(seq1,8,lenseq1),sep = "")
  }
}

#getting rid of duplicate mutations
mutsinseq1 <- mutsinseq1[-which(duplicated(mutsinseq1))]

The following is what I wish to produce (and is produced by my current code): 以下是我想要制作的内容(由我当前的代码生成):

mutsinseq1
[1] "GGCGACTG"  "AGCGACTG"  "TGCGACTG"  "CGCGACTG"  "GCGACTG"   "GACGACTG"  "GTCGACTG"  "GCCGACTG"  "GGAGACTG"  "GGTGACTG"  "GGGGACTG"  "GGGACTG"   "GGCAACTG" 
[14] "GGCTACTG"  "GGCCACTG"  "GGCACTG"   "GGCGTCTG"  "GGCGCCTG"  "GGCGGCTG"  "GGCGCTG"   "GGCGAATG"  "GGCGATTG"  "GGCGAGTG"  "GGCGATG"   "GGCGACAG"  "GGCGACCG" 
[27] "GGCGACGG"  "GGCGACG"   "GGCGACTA"  "GGCGACTT"  "GGCGACTC"  "GGCGACT"   "GAGCGACTG" "GTGCGACTG" "GCGCGACTG" "GGGCGACTG" "GGACGACTG" "GGTCGACTG" "GGCCGACTG"
[40] "GGCAGACTG" "GGCTGACTG" "GGCGGACTG" "GGCGAACTG" "GGCGTACTG" "GGCGCACTG" "GGCGATCTG" "GGCGACCTG" "GGCGAGCTG" "GGCGACATG" "GGCGACTTG" "GGCGACGTG" "GGCGACTAG"
[53] "GGCGACTCG" "GGCGACTGG"

How do I solve the problem? 我该如何解决这个问题?

In other languages, you might do this with a series of nested loops, but in R, there's some nice combinatorics functions. 在其他语言中,您可以使用一系列嵌套循环来执行此操作,但在R中,有一些很好的组合函数。 Here's the overall function to do what you want: 这是你想做的事情的整体功能:

library(stringr)
library(purrr)
library(dplyr)

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","_")) {
  l_str <- str_length(string)

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_tibble, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  apply(seq_matrix, 1, paste, collapse = "")
}

I used some packages to make things a little easier on me, but it could all be translated into base R. 我用了一些软件包让事情变得更容易,但它可以全部翻译成基础R.

Here's example output: 这是示例输出:

mutate_sequence("ATCG", num = 2)
  [1] "aaCG" "aTaG" "aTCa" "AaaG" "AaCa" "ATaa" "taCG" "tTaG" "tTCa" "AtaG" "AtCa" "ATta" "caCG" "cTaG" [15] "cTCa" "AcaG" "AcCa" "ATca" "gaCG" "gTaG" "gTCa" "AgaG" "AgCa" "ATga" "_aCG" "_TaG" "_TCa" "A_aG" [29] "A_Ca" "AT_a" "atCG" "aTtG" "aTCt" "AatG" "AaCt" "ATat" "ttCG" "tTtG" "tTCt" "AttG" "AtCt" "ATtt" [43] "ctCG" "cTtG" "cTCt" "ActG" "AcCt" "ATct" "gtCG" "gTtG" "gTCt" "AgtG" "AgCt" "ATgt" "_tCG" "_TtG" [57] "_TCt" "A_tG" "A_Ct" "AT_t" "acCG" "aTcG" "aTCc" "AacG" "AaCc" "ATac" "tcCG" "tTcG" "tTCc" "AtcG" [71] "AtCc" "ATtc" "ccCG" "cTcG" "cTCc" "AccG" "AcCc" "ATcc" "gcCG" "gTcG" "gTCc" "AgcG" "AgCc" "ATgc" [85] "_cCG" "_TcG" "_TCc" "A_cG" "A_Cc" "AT_c" "agCG" "aTgG" "aTCg" "AagG" "AaCg" "ATag" "tgCG" "tTgG" [99] "tTCg" "AtgG" "AtCg" "ATtg" "cgCG" "cTgG" "cTCg" "AcgG" "AcCg" "ATcg" "ggCG" "gTgG" "gTCg" "AggG" [113] "AgCg" "ATgg" "_gCG" "_TgG" "_TCg" "A_gG" "A_Cg" "AT_g" "a_CG" "aT_G" "aTC_" "Aa_G" "AaC_" "ATa_" [127] "t_CG" "tT_G" "tTC_" "At_G" "AtC_" "ATt_" "c_CG" "cT_G" "cTC_" "Ac_G" "AcC_" "ATc_" "g_CG" "gT_G" [141] "gTC_" "Ag_G" "AgC_" "ATg_" "__CG" "_T_G" "_TC_" "A__G" "A_C_" "AT__" 

I made the mutations lowercase or "_" to make it obvious where they are, but you can easily change that to get them back to "normal" sequences. 我将突变小写或“_”使其显而易见,但你可以很容易地改变它以使它们回到“正常”序列。

So each line does some things: 所以每一行都做了一些事情:

l_str <- str_length(string)

Gets the number of characters in the string. 获取字符串中的字符数。

combn(seq_len(l_str), num, simplify = F)

1) This is all possible combinations of positions along the sequence (indexes), taken num at a time, for the number of mutations. 1)这是沿着序列(索引位置的所有可能的组合),采取num的时间,对于突变的数量。

rerun(num, nucleotides)

2) This repeats your vector of nucleotides num times, and makes it a list. 2)重复此过程您的核苷酸载体num倍,使其成为一个列表。 cross(rerun(num, nucleotides)) then gives you every combination from that list as a list, so you're taking every possible combination of nucleotides, with repeats. cross(rerun(num, nucleotides))然后给你列表中的每个组合,所以你正在采取每个可能的核苷酸组合,重复。 cross(rerun(num, nucleotides)) %>% map(unlist) collapses the deepest level of the list into vectors. cross(rerun(num, nucleotides)) %>% map(unlist)将列表的最深层折叠为向量。

So those last two chunks give you every possible choice of positions, and then every possible combination of replacements. 因此,最后两个块可以为您提供各种可能的位置选择,然后是每种可能的替换组合。 Then we need every possible combination of those as pairs! 那么我们需要每一种可能的组合!

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

For the above output, that means: 对于上面的输出,这意味着:

 [[1]] [[1]]$`cols` [1] 1 2 [[1]]$muts [1] "A" "A" [[2]] [[2]]$`cols` [1] 1 2 [[2]]$muts [1] "T" "A" ... 

So first for positions 1/2 , it gives us A/A , T/A , G/A , C/A , _/A , etc. Then each combination again for positions 1/3 , then positions 1/4 , then 2/3 , then 2/4 , then 3/4 . 所以首先对于位置1/2 ,它给我们A / AT / AG / AC / A_ / A等。然后每个组合再次为位置1/3 ,然后位置1/4 ,然后2/3 ,然后是2/4 ,然后是3/4

So now you have this long list, and let's make it into something nicer. 所以现在你有这么长的清单,让我们把它变成更好的东西。 First we make each element into a dataframe with cols and muts , then bind them all into a single one with an identifier for each element called rows : 首先,我们将每个元素组成一个带有colsmuts的数据muts ,然后将它们全部绑定到一个数据muts ,每个元素都有一个名为rows元素的标识符:

map_dfr(choices, as_tibble, .id = "rows")
 # A tibble: 50 x 3 rows cols muts <chr> <int> <chr> 1 1 1 A 2 1 2 A 3 2 1 T 4 2 2 A 5 3 1 C 6 3 2 A 7 4 1 G 8 4 2 A 9 5 1 _ 10 5 2 A # ... with 40 more rows 

This gives us a long dataframe. 这给了我们一个很长的数据帧。 Each of rows is one output string, and the cols tells us which position in the string will be replaces. rows都是一个输出字符串, cols告诉我们字符串中的哪个位置将被替换。 muts is the characters that will go in those positions. muts是将进入这些位置的角色。 In order to do the subsetting later, we'll then convert rows to numeric, using mutate(...) . 为了稍后进行子集化,我们将使用mutate(...)rows转换为数字。

seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

Now we take your original string and repeat it as many times as the choice_matrix tells us we'll have mutated sequences. 现在我们采用原始字符串并重复多次,因为choice_matrix告诉我们我们将有变异序列。 Then we take that vector, and split every one along the character boundaries: 然后我们采用该向量,并沿着字符边界分割每个向量:

  [,1] [,2] [,3] [,4] [1,] "A" "T" "C" "G" [2,] "A" "T" "C" "G" [3,] "A" "T" "C" "G" [4,] "A" "T" "C" "G" [5,] "A" "T" "C" "G" [6,] "A" "T" "C" "G" ... 

Now we have a big matrix, and R is fast at operating on these big matrices. 现在我们有一个很大的矩阵,R在这些大矩阵上运行很快。 We could have done all the other steps with matrix operations, but that seemed like more work than using this list-combination functions. 我们可以用矩阵运算完成所有其他步骤,但这似乎比使用这个列表组合函数更多的工作。

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)

This identifies each position based on the rows and cols in the choice_matrix . 这标识基于所述每个位置rowscolschoice_matrix Then it puts the appropriate value from muts in it. 然后它将muts中的适当值muts其中。 This is also where you can take out str_to_lower to keep them from being lowercase. 这也是你可以取出str_to_lower以防止它们str_to_lower小写的地方。 You'd change the default argument of nucleotides to make the "_" into "" . 你要改变nucleotides的默认参数,使"_"成为""

  [,1] [,2] [,3] [,4] [1,] "a" "a" "C" "G" [2,] "a" "T" "a" "G" [3,] "a" "T" "C" "a" [4,] "A" "a" "a" "G" [5,] "A" "a" "C" "a" [6,] "A" "T" "a" "a" ... 

So row 1 got "A" and "A" in positions 1 and 2. Then row 2 got "A" and "A" in positions 1 and 3, etc. Now we just have to apply across each row (that's what the 1 in apply(..., 1, ...) does) a function to combine each row into a single string. 所以,第1行得到了“A”和“A”,在位置1和2。然后第2行得到了“A”,并在位置1和3,等“A”现在,我们只需要apply在每个行(这是什么1 in apply(..., 1, ...)确实)将每一行组合成一个字符串的函数。 That would be paste(..., collapse = "") . 这将是paste(..., collapse = "")

This will get you huge output quickly. 这将为您带来快速的巨大输出。 If you do 3 mutations on your original 8 nucleotide sequence, you get an output of 7000. 4 mutations is 43750. And each time gets that much slower, taking about 5s to run the 4 mutations on my desktop. 如果你在原始的8个核苷酸序列上做3个突变,你会得到7000个输出.4个突变是43750.每次变得慢得多,花大约5s来运行我桌面上的4个突变。 You could precalculate the output length, which is choose(l_str, num) * length(nucleotides)^num . 您可以预先计算输出长度,即choose(l_str, num) * length(nucleotides)^num


Updated, again: 再次更新:

To handle insertions as well as deletions, we just need the character matrix to have a slot for every possible insertion. 为了处理插入和删除,我们只需要字符矩阵为每个可能的插入都有一个插槽。 Here's that version: 这是那个版本:

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","")) {
  if (num < 1) {return(string)}

  l_str <- str_length(string)
  l_pos <- (num + 1)*(l_str - 1) + 1

  choices <- cross(list(
    cols = combn(seq_len(l_pos), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_data_frame, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  blanks <- character(l_pos)
  orig_pos <- (seq_len(l_str) - 1) * (num+1) + 1
  blanks[orig_pos] <- str_split(string, "", simplify = T)

  seq_matrix <- matrix(
    rep(blanks, max(choice_matrix$rows)), 
    ncol = l_pos, byrow = T
    )

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  sequences <- apply(seq_matrix, 1, paste, collapse = "")
  sequences[!duplicated(str_to_upper(sequences))]
}

This does essentially the same as the version of the function above, but first you create a blank vector with enough spots for every insertion. 这与上面函数的版本基本相同,但首先创建一个空白矢量,每个插入都有足够的点。 For each original nucleotide, you need an additional spot to insert after it, except the last one. 对于每个原始核苷酸,除了最后一个之外,您需要在其后插入一个额外的点。 That works out to l_pos <- (num + 1)*(l_str - 1) + 1 positions. 这适用于l_pos <- (num + 1)*(l_str - 1) + 1位置。 character(l_pos) gives you the blanks, and then you fill in the blanks with the original nucleotides at (seq_len(l_str) - 1) * (num+1) + 1 . character(l_pos)为您提供空白,然后在(seq_len(l_str) - 1) * (num+1) + 1处填充原始核苷酸的空白。

For example, ATCG with two mutations becomes "A" "" "" "T" "" "" "C" "" "" "G" . 例如,具有两个突变的ATCG变为"A" "" "" "T" "" "" "C" "" "" "G" The rest of the function works the same, just putting every possible nucleotide (or deletion) in every possible spot. 功能的其余部分工作原理相同,只是将每个可能的核苷酸(或删除)放在每个可能的位置。

The output before paste ing it all back together then looks like: 将它们paste在一起之前的输出然后看起来像:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "a"  "a"  ""   "T"  ""   ""   "C"  ""   ""   "G"  
[2,] "a"  ""   "a"  "T"  ""   ""   "C"  ""   ""   "G"  
[3,] "a"  ""   ""   "a"  ""   ""   "C"  ""   ""   "G"  
[4,] "a"  ""   ""   "T"  "a"  ""   "C"  ""   ""   "G"  
[5,] "a"  ""   ""   "T"  ""   "a"  "C"  ""   ""   "G" 
...  

Then after paste ing each row, we can check for repeats with duplicated and exclude those. 然后paste每一行后,我们可以检查重复的duplicated并排除它们。 We could also just get rid of the lowercase mutations and use unique(sequences) instead. 我们也可以摆脱小写突变并使用unique(sequences)代替。 Now the output is much shorter than before, the first 55 of 278: 现在输出比以前短得多,278的前55个:

  [1] "aaTCG" "aaCG" "aTaCG" "aTaG" "aTCaG" "aTCa" "AaaTCG" "AaaCG" "AaTaCG" "AaTaG" "AaTCaG" [12] "AaTCa" "AaaG" "AaCaG" "AaCa" "ATaaCG" "ATaaG" "ATaCaG" "ATaCa" "ATaa" "ATCaaG" "ATCaa" [23] "taTCG" "taCG" "tTaCG" "tTaG" "tTCaG" "tTCa" "AtaTCG" "AtTaCG" "AtTaG" "AtTCaG" "AtTCa" [34] "ATta" "ATCtaG" "ATCta" "caTCG" "caCG" "cTaCG" "cTaG" "cTCaG" "cTCa" "AcaTCG" "AcaCG" [45] "AcTaCG" "AcTaG" "AcTCaG" "AcTCa" "AcaG" "AcCaG" "AcCa" "ATcaCG" "ATcCaG" "ATcCa" "gaTCG" ... 

EDITED Entirely revised for a third time to better address the question! 编辑 第三次完全修改,以更好地解决问题! Incidentally, the key solution here (in the form of three helper functions) does not require the Biostrings package. 顺便提一下,这里的关键解决方案(以三个辅助函数的形式)不需要Biostrings包。

As I understand the problem, a short DNA query sequence is to be matched against a large number of reference DNA sequences. 据我所知,短DNA查询序列与大量参考DNA序列相匹配。 The twist here is that an arbitrary number of variations in the form of insertions or deletions on the DNA query sequence are to be searched for in the reference DNA sequences. 这里的扭曲是在参考DNA序列中搜索DNA查询序列上插入或缺失形式的任意数量的变异。

The function vmatchPattern() from the Biostrings package can identify matches of a given pattern with an arbitrary number of mismatches in a set of reference sequences. 来自Biostrings包的函数vmatchPattern()可以识别给定模式的匹配与一组参考序列中的任意数量的不匹配 In addition, vmatchPattern() can identify matches of a given pattern with possible insertions or deletions (indel). 此外, vmatchPattern()可以识别给定模式的匹配以及可能的插入或删除 (indel)。 However, unlike matchPattern() , vmatchPattern() cannot do both at the same time. 但是,与matchPattern()不同, vmatchPattern() 不能同时执行这两个操作。

The solution sought here is to generate generate variations of the DNA query sequence that can then be passed to a search function such as grepl() or as suggested here, vmatchPattern() . 这里寻求的解决方案是生成DNA查询序列的生成变体,然后可以将其传递给搜索函数,例如grepl()或这里建议的vmatchPattern()

The revised solution posted here includes three functions. 此处发布的修订解决方案包括三个功能。 makeDel() will generate all possible variants of a short sequence with an arbitrary number of deletions. makeDel()将生成具有任意数量删除的短序列的所有可能变体。 The companion function, makeIns() will generate variants of the short sequence with the insertion specified as the IUPAC symbol in symbol . 伴随函数makeIns()将生成短序列的变体,插入在symbol指定为IUPAC symbol makeSub() will make the desired substitutions using the bases specified by the IUPAC code in symbol . makeSub()将使用symbol IUPAC代码指定的基数进行所需的替换。 This approach, generating all possible combinations of other bases, allows the character strings to be directly used in pattern-matching functions including vmatchPaterrn . 这种方法生成其他碱基的所有可能组合,允许字符串直接用于模式匹配函数,包括vmatchPaterrn

If it is going to be used, this ensures that the package Biostrings is available. 如果要使用它,这可确保Biostrings包可用。 This code applies to versions of R at 3.60 and beyond. 此代码适用于3.60及更高版本的R版本。

  if (!require("Biostrings")) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("Biostrings")
  }
  library(Biostrings)

Now some test data. 现在一些测试数据。 The original query sequence "GGCGACTG" will be used as the "query" and 1000 random sequences between 200 and 400 nucleotides will be used as the reference set. 原始查询序列"GGCGACTG"将用作“查询”,并且将使用200至400个核苷酸之间的1000个随机序列作为参考集。

  seq1 <- "GGCGACTG"

  set.seed(1234)
  ref <- replicate(1e3,
    sample(c("A", "C", "G", "T"), sample(200:400, 1), replace = TRUE),
    simplify = FALSE)
  ref <- sapply(ref, paste, collapse = "")
  ref <- DNAStringSet(ref)

Before proceeding with the solution, let's peek at what can be found with the pattern alone. 在继续解决之前,让我们先看看模式中可以找到的内容。

# how times does the pattern occur? 
  table(vcountPattern(seq1, ref)) # just 3 times
>   0   1 
> 997   3 

# how many times allowing for one mismatch?
# once in 96 sequences and twice in three sequences
  n <- vcountPattern(seq1, ref, max.mismatch = 1)
  table(n)
> n
>   0   1   2 
> 901  96   3 

# examine the matched sequences
  m <- vmatchPattern(seq1, ref, max.mismatch = 1) # find the patterns
  sel <- which.max(n) # select the first one with 2 matches
  Views(ref[[sel]], m[[sel]]) # examine the matches
>   Views on a 384-letter DNAString subject
> subject: TCGCGTCGCACTTCTGCTAACACAGC...GCCCAGTCGACTGCTGCTCGGATTGC
> views:
>     start end width
> [1]   104 111     8 [GGCGACCG]
> [2]   364 371     8 [GTCGACTG] 

Here are the three helper functions to generate the variants. 以下是生成变体的三个辅助函数。 The argument seq can be a character string such as "GGGCGACTG" or a DNAString object. 参数seq可以是字符串,例如“GGGCGACTG”或DNAString对象。 The argument n is an integer that specifies the upper limit on deletions, insertions, or substitutions. 参数n是一个整数,指定删除,插入或替换的上限。 These functions will create variants with 0, 1, ..., n deletions, insertions or substitutions. 这些函数将创建具有0,1,...,n删除,插入或替换的变体。 If n is set to 0, the function will return the original sequence. 如果n设置为0,则该函数将返回原始序列。 The argument symbol for makeIns() and makeSub() should be a single IUPAC character to specify which bases will be inserted or substituted. makeIns()makeSub()的参数symbol应该是单个IUPAC字符,用于指定要插入或替换的碱基。 The default value of "N" specifies all possible bases ("A", "C", "G" and "T"). 默认值“N”指定所有可能的基数(“A”,“C”,“G”和“T”)。

makeDel() use combn() to identify the deletion positions. makeDel()使用combn()来识别删除位置。 The logic for makeIns() and makeSub() is a bit more complex because of the need to allow insertions to be adjacent to each other and the need to create all combinations. makeIns()makeSub()的逻辑有点复杂,因为需要允许插入彼此相邻并且需要创建所有组合。 Here I chose not to add insertions at the beginning or end of the query sequence. 在这里,我选择不在查询序列的开头或结尾添加插入。

All functions return a character vector suitable for use in vmatchPattern() or grep() . 所有函数都返回一个适合在vmatchPattern()grep()使用的字符向量。

To create deletions in a DNA string: 要在DNA字符串中创建删除:

  ##
  ## makeDel - create 1:n deletions in a character string (DNA sequence)
  ##  return a character vector of all possible variants
  ##
  makeDel <- function(seq, n) {
  # accept only a single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

  # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n >= nseq) stop("too many deletions for ", nseq, " letters")

  # create all possible combinations to be dropped in 'index'
    index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
    index <- unlist(index, recursive = FALSE)

  # drop base in each possible position and reassemble
    ans <- lapply(index, function(idx) cseq[-idx])
    ans <- sapply(ans, paste, collapse = "")
    ans <- unique(ans) # remove duplicates
    return(ans)
  }

To create insertions in a DNA string: 要在DNA字符串中创建插入:

  ##
  ## makeIns - create 1:n insertions into DNA string (character vector)
  ##   where each insertion is one of a given IUPAC-specified symbol
  ##   return a character vector of all possible variants
  ##
  makeIns <- function(seq, n, symbol = "N") {
  # IUPAC codes for ambiguous bases
    iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
      W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
      D = "AGT", B = "CGT")

 # only accept single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

 # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    symbol <- toupper(symbol)
    if (nchar(symbol) != 1 | !symbol %in% names(iupac))
      stop("'symbol' must be a single valid IUPAC symbol")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n >= nseq) stop("seems like too many insertions for ", nseq, " letters")

  # which bases are to be inserted?
    ACGT <- strsplit(iupac[symbol], "")[[1]]

  # create all possible combinations of positions for the insertion 
    ipos <- seq_len(nseq - 1) # insert after this position
    index <- lapply(1:n, function(i) do.call(expand.grid, rep(list(ipos), i)))
    index <- lapply(index, function(v) split(v, seq_len(nrow(v))))
    index <- unlist(index, recursive = FALSE)
    index <- lapply(index, unlist)
    index <- lapply(index, sort)

  # place the required number of insertions after each position in index
    res <- lapply(index, function(idx) {
      tally <- rle(idx)$lengths
      breaks <- diff(c(0, idx, nseq))
      NN <- Map(base::rep, symbol, tally)
      spl <- split(cseq, rep(seq_along(breaks), breaks))
      sel <- head(seq_along(spl), -1)
      spl[sel] <- Map(base::c, spl[sel], NN)
      ans <- unlist(spl)
      if (length(ACGT) > 1) { # replicate and replace with appropriate bases
        sites <- grep(symbol, ans)
        nsites <- length(sites)
        nsymbol <- length(ACGT)

        bases <- expand.grid(rep(list(ACGT), nsites), stringsAsFactors = FALSE)
        bases <- as.matrix(bases)
        nvars <- nrow(bases)

        ans <- do.call(rbind, rep(list(ans), nvars))
        ans[, sites] <- bases
        ans <- split(ans, seq_len(nvars))
        ans <- lapply(ans, paste, collapse = "")
      }
      else
        ans <- paste(ans, collapse = "")
      return(ans)
    })
    res <- unlist(res)
    res <- unique(res)
    return(res)
  }

To create substitutions in DNA string: 要在DNA字符串中创建替换:

  ##
  ## makeSub - create an arbitrary number of substitutions in each 1:n positions
  ##   with the IUPAC bases specified by 'symbol'
  ##   return a character vector with all possible variants
  ##
  makeSub <- function(seq, n, symbol = "N")
  {
  # IUPAC codes for ambiguous bases
    iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
      W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
      D = "AGT", B = "CGT")

  # accept only a single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

  # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    symbol <- toupper(symbol)
    if (nchar(symbol) != 1 | !symbol %in% names(iupac))
      stop("'symbol' must be a single valid IUPAC symbol")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n > nseq) stop("too many substitutions for ", nseq, " bases")

  # which bases are to be used for the substitution?
    ACGT <- strsplit(iupac[symbol], "")[[1]]

  # create all possible combinations of positions to be changed in 'index'
    index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
    index <- unlist(index, recursive = FALSE)

  # for each numeric vector in index, create as many variants as
  # alternative bases are needed, collect in 'ans'
    ans <- lapply(index, function(idx) {
      bases <- lapply(cseq[idx], function(v) setdiff(ACGT, v))
      bases <- bases[sapply(bases, length) > 0] # defensive 
      bases <- expand.grid(bases, stringsAsFactors = FALSE)
      bases <- as.matrix(bases)
      nvars <- nrow(bases)

      vars <- do.call(rbind, rep(list(cseq), nvars))
      vars[ ,idx] <- bases
      if (!is.null(vars))
        return(split(vars, seq_len(nvars)))
    })
    ans <- unlist(ans, recursive = FALSE)
    ans <- sapply(ans, paste, collapse = "")
    ans <- unique(ans) # remove duplicates
    return(ans)
  }

Examples of the output: 输出示例:

  makeDel(seq1, 0)
> [1] "GGCGACTG"

  makeDel(seq1, 1)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"

  makeDel(seq1, 2)
>  [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
>  [8] "CGACTG"  "GGACTG"  "GCACTG"  "GCGCTG"  "GCGATG"  "GCGACG"  "GCGACT" 
> [15] "GGGCTG"  "GGGATG"  "GGGACG"  "GGGACT"  "GGCCTG"  "GGCATG"  "GGCACG" 
> [22] "GGCACT"  "GGCGTG"  "GGCGCG"  "GGCGCT"  "GGCGAG"  "GGCGAT"  "GGCGAC" 

  makeIns(seq1, 1) # default form
>  [1] "GAGCGACTG" "GCGCGACTG" "GGGCGACTG" "GTGCGACTG" "GGACGACTG" "GGCCGACTG"
>  [7] "GGTCGACTG" "GGCAGACTG" "GGCGGACTG" "GGCTGACTG" "GGCGAACTG" "GGCGCACTG"
> [13] "GGCGTACTG" "GGCGACCTG" "GGCGAGCTG" "GGCGATCTG" "GGCGACATG" "GGCGACGTG"
> [19] "GGCGACTTG" "GGCGACTAG" "GGCGACTCG" "GGCGACTGG"

  makeIns(seq1, 1, symbol = "Y") # inserting only "C" or "T"
>  [1] "GCGCGACTG" "GTGCGACTG" "GGCCGACTG" "GGTCGACTG" "GGCTGACTG" "GGCGCACTG"
>  [7] "GGCGTACTG" "GGCGACCTG" "GGCGATCTG" "GGCGACTTG" "GGCGACTCG"

  makeSub("AAA", 1)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT"

  makeSub("AAA", 2)
>  [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT" "CCA" "GCA" "TCA"
> [13] "CGA" "GGA" "TGA" "CTA" "GTA" "TTA" "CAC" "GAC" "TAC" "CAG" "GAG" "TAG"
> [25] "CAT" "GAT" "TAT" "ACC" "AGC" "ATC" "ACG" "AGG" "ATG" "ACT" "AGT" "ATT"

These functions can be used together with vmatchPattern() to create variants and extract matches. 这些函数可以与vmatchPattern()一起使用来创建变体并提取匹配项。 One suggested approach would be to first find those sequences with mismatches using max.mismatch = 1 . 一种建议的方法是首先使用max.mismatch = 1找到具有不匹配的序列。 Next , find sequences with deletions and with insertions using vmatchPattern() with fixed = FALSE and the default value of 0 for max.mismatch . 接着 ,找到序列与缺失和使用插入vmatchPattern()fixed = FALSE和0作为默认值max.mismatch

Alternatively , the explicit patterns generated by the helper functions can be passed to grep processes running in parallel! 或者 ,辅助函数生成的显式模式可以传递给并行运行的grep进程! What follows shows the use of vmatchPattern but there may be reasons to perform the analysis with different tools. 以下内容显示了vmatchPattern的使用,但可能有理由使用不同的工具执行分析。 See the comments on this topic. 请参阅有关此主题的评论。

# first, allow mismatches to the original pattern
# the result is a "ByPos_MIndex" object of length 1000
  m1 <- vmatchPattern(seq1, ref, max.mismatch = 1) # as before...
  n1 <- elementNROWS(m1) # counts the number of elements (matches)
  which(n1 > 0) # which of the 1000 ref sequence had a match with 0 or 1 mismatches?
>  [1]  14  71  73  76  79  88  90 108 126 129 138 141 150 160 163 179 180 195 200
> [20] 205 212 225 227 239 241 246 247 255 276 277 280 299 310 335 338 345 347 357
> [39] 359 369 378 383 387 390 391 404 409 410 414 418 469 472 479 488 499 509 523
> [58] 531 533 567 571 574 580 588 590 591 594 601 634 636 646 654 667 679 685 694
> [77] 696 713 717 732 734 737 749 750 761 762 783 815 853 854 857 903 929 943 959
> [96] 969 981 986 998

# Second search each of the patterns with lapply
# generates seven lists of objects, each of length 10000
  pat2 <- makeDel(seq1, 1)
  m2 <- lapply(pat2, function(pat) vmatchPattern(pat, ref))

# generates 22 lists of objects, each of length 10000
  pat3 <- makeIns(seq1, 1)
  m3 <- lapply(pat3, function(pat) vmatchPattern(pat, ref))

The second and third results in m2 and m3 are lists of "ByPos_MIndex" objects. m2m3中的第二个和第三个结果是“ByPos_MIndex”对象的列表。 The example below extracts the number of matches from m2 and shows these matches in an abbreviated form with str() . 下面的示例从m2中提取匹配数,并以str()的缩写形式显示这些匹配项。 Each value in a list identifies the reference sequence that had at least one match with respective pattern. 列表中的每个值标识具有至少一个与相应模式匹配的参考序列。

  n2 <- lapply(m2, elementNROWS)
  str(sapply(n2, function(n) which(n > 0)))
> List of 7
>  $ : int [1:14] 14 138 179 335 369 391 567 679 713 734 ...
>  $ : int [1:18] 138 200 240 298 310 343 510 594 598 599 ...
>  $ : int [1:15] 21 26 45 60 260 497 541 600 607 642 ...
>  $ : int [1:17] 27 54 120 121 123 132 210 242 244 257 ...
>  $ : int [1:18] 15 33 110 126 154 419 528 539 546 606 ...
>  $ : int [1:12] 24 77 79 139 525 588 601 679 770 850 ...
>  $ : int [1:15] 179 345 378 414 469 571 574 580 591 713 ...

This final example examines the third list of 22 "ByPos_MIndex" objects ( m3 ) by the same mechanism. 最后一个示例通过相同的机制检查22个“ByPos_MIndex”对象( m3 )的第三个列表。 It shows that some of the 22 variants fail to match, some match once and five match twice. 它显示22种变体中的一些不匹配,一些匹配一次,五次匹配两次。

    n3 <- lapply(m3, elementNROWS) # extract all counts
    names(n3) <- sprintf("pat_%02d", seq_along(n3)) # for convenience
    str(lapply(n3, function(n) which(n > 0)))
> List of 22
>  $ pat_01: int 679
>  $ pat_02: int 391
>  $ pat_03: int(0) 
>  $ pat_04: int 737
>  $ pat_05: int(0) 
>  $ pat_06: int(0) 
>  $ pat_07: int 108
>  $ pat_08: int 276
>  $ pat_09: int 439
>  $ pat_10: int [1:2] 764 773
>  $ pat_11: int(0) 
>  $ pat_12: int [1:2] 22 820
>  $ pat_13: int 795
>  $ pat_14: int [1:2] 914 981
>  $ pat_15: int(0) 
>  $ pat_16: int 112
>  $ pat_17: int 884
>  $ pat_18: int(0) 
>  $ pat_19: int [1:2] 345 378
>  $ pat_20: int [1:2] 571 854
>  $ pat_21: int 574
>  $ pat_22: int(0) 

Needless to say, a lot of data wrangling remains in order to extract sequence information. 不用说,为了提取序列信息,仍然存在大量数据争用。 This can be coded with the help pages for matchPattern and with some understanding of the pattern matching logic described in help("lowlevel-matching", package = "Biostrings") . 这可以使用matchPattern的帮助页面进行编码,并对help help("lowlevel-matching", package = "Biostrings")描述的模式匹配逻辑有所了解。

Although the routines in Biostrings use very fast and very memory-efficient algorithms for handling large sequences. 虽然Biostrings的例程使用非常快速且非常节省内存的算法来处理大型序列。 Joe seems to find raw searching faster under other circumstances. Joe似乎在其他情况下更快地找到原始搜索。 There's always more to learn! 总有更多需要学习的东西!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM