简体   繁体   中英

Making a list of all mutations of a sequence (DNA)

I have a DNA sequence, and I want to find all instances of it, or any of its possible mutations in a list of DNA sequence reads. I am using grepl to do this, since it is faster than matchPattern in the instance I am using it. I use parLapply to feed my vector of mutations to the grepl function. But what I am interested in doing is making an easy way of auto-generating my vector of sequence mutations. Originally I typed each mutation, but that leaves room for human error, and if the sequence is lengthened, more mutations would need to be typed. In addition, my current code only allows 1 mutation, and some sequences should allow for more mutations than others. I am not looking for someone to write a loop for me, but just give me a suggestion for accounting for any string.

Right now, I have a semi-automated way of generating the mutations. It now generates the vector without me typing them all out, but only works for 8 nucleotide long sequences. There has to be a better way to generate the vector for any nucleotide sequence of any length.

This is my code:

#My sequence of interest
seq1 <- "GGCGACTG"
lenseq1 <- nchar(seq1)

#A vector of the length of the sequence I wish to create all mutations of
mutsinseq1 <- rep(seq1, 5*lenseq1+4*(lenseq1-1)+1)

#The possible substitutions, insertions, and deletions to the sequence of interest
possnuc <- c("A","T","C","G","")
lenpossnuc <- length(possnuc)

#changing all elements of the vector except for the first
#the first 8 if statements are nucleotide substitutions or deletions
#the other if statements allow for inserts between nucleotides
for(i in 2:length(mutsinseq1)){
  if(i<7){
    mutsinseq1[i] <- paste(possnuc[i-1],substr(seq1,2,lenseq1),sep = "") 
  } else if(i<12){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-6],substr(seq1,3,lenseq1),sep = "")
  } else if(i<17){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-11],substr(seq1,4,lenseq1),sep = "")
  } else if(i<22){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-16],substr(seq1,5,lenseq1),sep = "")
  } else if(i<27){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-21],substr(seq1,6,lenseq1),sep = "")
  } else if(i<32){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-26],substr(seq1,7,lenseq1),sep = "")
  } else if(i<37){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-31],substr(seq1,8,lenseq1),sep = "")
  } else if(i<42){
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-36],sep = "")
  } else if(i<46){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-41],substr(seq1,2,lenseq1),sep = "")
  } else if(i<50){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-45],substr(seq1,3,lenseq1),sep = "")
  } else if(i<54){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-49],substr(seq1,4,lenseq1),sep = "")
  } else if(i<58){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-53],substr(seq1,5,lenseq1),sep = "")
  } else if(i<62){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-57],substr(seq1,6,lenseq1),sep = "")
  } else if(i<66){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-61],substr(seq1,7,lenseq1),sep = "")
  } else{
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-65],substr(seq1,8,lenseq1),sep = "")
  }
}

#getting rid of duplicate mutations
mutsinseq1 <- mutsinseq1[-which(duplicated(mutsinseq1))]

The following is what I wish to produce (and is produced by my current code):

mutsinseq1
[1] "GGCGACTG"  "AGCGACTG"  "TGCGACTG"  "CGCGACTG"  "GCGACTG"   "GACGACTG"  "GTCGACTG"  "GCCGACTG"  "GGAGACTG"  "GGTGACTG"  "GGGGACTG"  "GGGACTG"   "GGCAACTG" 
[14] "GGCTACTG"  "GGCCACTG"  "GGCACTG"   "GGCGTCTG"  "GGCGCCTG"  "GGCGGCTG"  "GGCGCTG"   "GGCGAATG"  "GGCGATTG"  "GGCGAGTG"  "GGCGATG"   "GGCGACAG"  "GGCGACCG" 
[27] "GGCGACGG"  "GGCGACG"   "GGCGACTA"  "GGCGACTT"  "GGCGACTC"  "GGCGACT"   "GAGCGACTG" "GTGCGACTG" "GCGCGACTG" "GGGCGACTG" "GGACGACTG" "GGTCGACTG" "GGCCGACTG"
[40] "GGCAGACTG" "GGCTGACTG" "GGCGGACTG" "GGCGAACTG" "GGCGTACTG" "GGCGCACTG" "GGCGATCTG" "GGCGACCTG" "GGCGAGCTG" "GGCGACATG" "GGCGACTTG" "GGCGACGTG" "GGCGACTAG"
[53] "GGCGACTCG" "GGCGACTGG"

How do I solve the problem?

In other languages, you might do this with a series of nested loops, but in R, there's some nice combinatorics functions. Here's the overall function to do what you want:

library(stringr)
library(purrr)
library(dplyr)

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","_")) {
  l_str <- str_length(string)

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_tibble, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  apply(seq_matrix, 1, paste, collapse = "")
}

I used some packages to make things a little easier on me, but it could all be translated into base R.

Here's example output:

mutate_sequence("ATCG", num = 2)
  [1] "aaCG" "aTaG" "aTCa" "AaaG" "AaCa" "ATaa" "taCG" "tTaG" "tTCa" "AtaG" "AtCa" "ATta" "caCG" "cTaG" [15] "cTCa" "AcaG" "AcCa" "ATca" "gaCG" "gTaG" "gTCa" "AgaG" "AgCa" "ATga" "_aCG" "_TaG" "_TCa" "A_aG" [29] "A_Ca" "AT_a" "atCG" "aTtG" "aTCt" "AatG" "AaCt" "ATat" "ttCG" "tTtG" "tTCt" "AttG" "AtCt" "ATtt" [43] "ctCG" "cTtG" "cTCt" "ActG" "AcCt" "ATct" "gtCG" "gTtG" "gTCt" "AgtG" "AgCt" "ATgt" "_tCG" "_TtG" [57] "_TCt" "A_tG" "A_Ct" "AT_t" "acCG" "aTcG" "aTCc" "AacG" "AaCc" "ATac" "tcCG" "tTcG" "tTCc" "AtcG" [71] "AtCc" "ATtc" "ccCG" "cTcG" "cTCc" "AccG" "AcCc" "ATcc" "gcCG" "gTcG" "gTCc" "AgcG" "AgCc" "ATgc" [85] "_cCG" "_TcG" "_TCc" "A_cG" "A_Cc" "AT_c" "agCG" "aTgG" "aTCg" "AagG" "AaCg" "ATag" "tgCG" "tTgG" [99] "tTCg" "AtgG" "AtCg" "ATtg" "cgCG" "cTgG" "cTCg" "AcgG" "AcCg" "ATcg" "ggCG" "gTgG" "gTCg" "AggG" [113] "AgCg" "ATgg" "_gCG" "_TgG" "_TCg" "A_gG" "A_Cg" "AT_g" "a_CG" "aT_G" "aTC_" "Aa_G" "AaC_" "ATa_" [127] "t_CG" "tT_G" "tTC_" "At_G" "AtC_" "ATt_" "c_CG" "cT_G" "cTC_" "Ac_G" "AcC_" "ATc_" "g_CG" "gT_G" [141] "gTC_" "Ag_G" "AgC_" "ATg_" "__CG" "_T_G" "_TC_" "A__G" "A_C_" "AT__" 

I made the mutations lowercase or "_" to make it obvious where they are, but you can easily change that to get them back to "normal" sequences.

So each line does some things:

l_str <- str_length(string)

Gets the number of characters in the string.

combn(seq_len(l_str), num, simplify = F)

1) This is all possible combinations of positions along the sequence (indexes), taken num at a time, for the number of mutations.

rerun(num, nucleotides)

2) This repeats your vector of nucleotides num times, and makes it a list. cross(rerun(num, nucleotides)) then gives you every combination from that list as a list, so you're taking every possible combination of nucleotides, with repeats. cross(rerun(num, nucleotides)) %>% map(unlist) collapses the deepest level of the list into vectors.

So those last two chunks give you every possible choice of positions, and then every possible combination of replacements. Then we need every possible combination of those as pairs!

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

For the above output, that means:

 [[1]] [[1]]$`cols` [1] 1 2 [[1]]$muts [1] "A" "A" [[2]] [[2]]$`cols` [1] 1 2 [[2]]$muts [1] "T" "A" ... 

So first for positions 1/2 , it gives us A/A , T/A , G/A , C/A , _/A , etc. Then each combination again for positions 1/3 , then positions 1/4 , then 2/3 , then 2/4 , then 3/4 .

So now you have this long list, and let's make it into something nicer. First we make each element into a dataframe with cols and muts , then bind them all into a single one with an identifier for each element called rows :

map_dfr(choices, as_tibble, .id = "rows")
 # A tibble: 50 x 3 rows cols muts <chr> <int> <chr> 1 1 1 A 2 1 2 A 3 2 1 T 4 2 2 A 5 3 1 C 6 3 2 A 7 4 1 G 8 4 2 A 9 5 1 _ 10 5 2 A # ... with 40 more rows 

This gives us a long dataframe. Each of rows is one output string, and the cols tells us which position in the string will be replaces. muts is the characters that will go in those positions. In order to do the subsetting later, we'll then convert rows to numeric, using mutate(...) .

seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

Now we take your original string and repeat it as many times as the choice_matrix tells us we'll have mutated sequences. Then we take that vector, and split every one along the character boundaries:

  [,1] [,2] [,3] [,4] [1,] "A" "T" "C" "G" [2,] "A" "T" "C" "G" [3,] "A" "T" "C" "G" [4,] "A" "T" "C" "G" [5,] "A" "T" "C" "G" [6,] "A" "T" "C" "G" ... 

Now we have a big matrix, and R is fast at operating on these big matrices. We could have done all the other steps with matrix operations, but that seemed like more work than using this list-combination functions.

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)

This identifies each position based on the rows and cols in the choice_matrix . Then it puts the appropriate value from muts in it. This is also where you can take out str_to_lower to keep them from being lowercase. You'd change the default argument of nucleotides to make the "_" into "" .

  [,1] [,2] [,3] [,4] [1,] "a" "a" "C" "G" [2,] "a" "T" "a" "G" [3,] "a" "T" "C" "a" [4,] "A" "a" "a" "G" [5,] "A" "a" "C" "a" [6,] "A" "T" "a" "a" ... 

So row 1 got "A" and "A" in positions 1 and 2. Then row 2 got "A" and "A" in positions 1 and 3, etc. Now we just have to apply across each row (that's what the 1 in apply(..., 1, ...) does) a function to combine each row into a single string. That would be paste(..., collapse = "") .

This will get you huge output quickly. If you do 3 mutations on your original 8 nucleotide sequence, you get an output of 7000. 4 mutations is 43750. And each time gets that much slower, taking about 5s to run the 4 mutations on my desktop. You could precalculate the output length, which is choose(l_str, num) * length(nucleotides)^num .


Updated, again:

To handle insertions as well as deletions, we just need the character matrix to have a slot for every possible insertion. Here's that version:

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","")) {
  if (num < 1) {return(string)}

  l_str <- str_length(string)
  l_pos <- (num + 1)*(l_str - 1) + 1

  choices <- cross(list(
    cols = combn(seq_len(l_pos), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_data_frame, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  blanks <- character(l_pos)
  orig_pos <- (seq_len(l_str) - 1) * (num+1) + 1
  blanks[orig_pos] <- str_split(string, "", simplify = T)

  seq_matrix <- matrix(
    rep(blanks, max(choice_matrix$rows)), 
    ncol = l_pos, byrow = T
    )

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  sequences <- apply(seq_matrix, 1, paste, collapse = "")
  sequences[!duplicated(str_to_upper(sequences))]
}

This does essentially the same as the version of the function above, but first you create a blank vector with enough spots for every insertion. For each original nucleotide, you need an additional spot to insert after it, except the last one. That works out to l_pos <- (num + 1)*(l_str - 1) + 1 positions. character(l_pos) gives you the blanks, and then you fill in the blanks with the original nucleotides at (seq_len(l_str) - 1) * (num+1) + 1 .

For example, ATCG with two mutations becomes "A" "" "" "T" "" "" "C" "" "" "G" . The rest of the function works the same, just putting every possible nucleotide (or deletion) in every possible spot.

The output before paste ing it all back together then looks like:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "a"  "a"  ""   "T"  ""   ""   "C"  ""   ""   "G"  
[2,] "a"  ""   "a"  "T"  ""   ""   "C"  ""   ""   "G"  
[3,] "a"  ""   ""   "a"  ""   ""   "C"  ""   ""   "G"  
[4,] "a"  ""   ""   "T"  "a"  ""   "C"  ""   ""   "G"  
[5,] "a"  ""   ""   "T"  ""   "a"  "C"  ""   ""   "G" 
...  

Then after paste ing each row, we can check for repeats with duplicated and exclude those. We could also just get rid of the lowercase mutations and use unique(sequences) instead. Now the output is much shorter than before, the first 55 of 278:

  [1] "aaTCG" "aaCG" "aTaCG" "aTaG" "aTCaG" "aTCa" "AaaTCG" "AaaCG" "AaTaCG" "AaTaG" "AaTCaG" [12] "AaTCa" "AaaG" "AaCaG" "AaCa" "ATaaCG" "ATaaG" "ATaCaG" "ATaCa" "ATaa" "ATCaaG" "ATCaa" [23] "taTCG" "taCG" "tTaCG" "tTaG" "tTCaG" "tTCa" "AtaTCG" "AtTaCG" "AtTaG" "AtTCaG" "AtTCa" [34] "ATta" "ATCtaG" "ATCta" "caTCG" "caCG" "cTaCG" "cTaG" "cTCaG" "cTCa" "AcaTCG" "AcaCG" [45] "AcTaCG" "AcTaG" "AcTCaG" "AcTCa" "AcaG" "AcCaG" "AcCa" "ATcaCG" "ATcCaG" "ATcCa" "gaTCG" ... 

EDITED Entirely revised for a third time to better address the question! Incidentally, the key solution here (in the form of three helper functions) does not require the Biostrings package.

As I understand the problem, a short DNA query sequence is to be matched against a large number of reference DNA sequences. The twist here is that an arbitrary number of variations in the form of insertions or deletions on the DNA query sequence are to be searched for in the reference DNA sequences.

The function vmatchPattern() from the Biostrings package can identify matches of a given pattern with an arbitrary number of mismatches in a set of reference sequences. In addition, vmatchPattern() can identify matches of a given pattern with possible insertions or deletions (indel). However, unlike matchPattern() , vmatchPattern() cannot do both at the same time.

The solution sought here is to generate generate variations of the DNA query sequence that can then be passed to a search function such as grepl() or as suggested here, vmatchPattern() .

The revised solution posted here includes three functions. makeDel() will generate all possible variants of a short sequence with an arbitrary number of deletions. The companion function, makeIns() will generate variants of the short sequence with the insertion specified as the IUPAC symbol in symbol . makeSub() will make the desired substitutions using the bases specified by the IUPAC code in symbol . This approach, generating all possible combinations of other bases, allows the character strings to be directly used in pattern-matching functions including vmatchPaterrn .

If it is going to be used, this ensures that the package Biostrings is available. This code applies to versions of R at 3.60 and beyond.

  if (!require("Biostrings")) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("Biostrings")
  }
  library(Biostrings)

Now some test data. The original query sequence "GGCGACTG" will be used as the "query" and 1000 random sequences between 200 and 400 nucleotides will be used as the reference set.

  seq1 <- "GGCGACTG"

  set.seed(1234)
  ref <- replicate(1e3,
    sample(c("A", "C", "G", "T"), sample(200:400, 1), replace = TRUE),
    simplify = FALSE)
  ref <- sapply(ref, paste, collapse = "")
  ref <- DNAStringSet(ref)

Before proceeding with the solution, let's peek at what can be found with the pattern alone.

# how times does the pattern occur? 
  table(vcountPattern(seq1, ref)) # just 3 times
>   0   1 
> 997   3 

# how many times allowing for one mismatch?
# once in 96 sequences and twice in three sequences
  n <- vcountPattern(seq1, ref, max.mismatch = 1)
  table(n)
> n
>   0   1   2 
> 901  96   3 

# examine the matched sequences
  m <- vmatchPattern(seq1, ref, max.mismatch = 1) # find the patterns
  sel <- which.max(n) # select the first one with 2 matches
  Views(ref[[sel]], m[[sel]]) # examine the matches
>   Views on a 384-letter DNAString subject
> subject: TCGCGTCGCACTTCTGCTAACACAGC...GCCCAGTCGACTGCTGCTCGGATTGC
> views:
>     start end width
> [1]   104 111     8 [GGCGACCG]
> [2]   364 371     8 [GTCGACTG] 

Here are the three helper functions to generate the variants. The argument seq can be a character string such as "GGGCGACTG" or a DNAString object. The argument n is an integer that specifies the upper limit on deletions, insertions, or substitutions. These functions will create variants with 0, 1, ..., n deletions, insertions or substitutions. If n is set to 0, the function will return the original sequence. The argument symbol for makeIns() and makeSub() should be a single IUPAC character to specify which bases will be inserted or substituted. The default value of "N" specifies all possible bases ("A", "C", "G" and "T").

makeDel() use combn() to identify the deletion positions. The logic for makeIns() and makeSub() is a bit more complex because of the need to allow insertions to be adjacent to each other and the need to create all combinations. Here I chose not to add insertions at the beginning or end of the query sequence.

All functions return a character vector suitable for use in vmatchPattern() or grep() .

To create deletions in a DNA string:

  ##
  ## makeDel - create 1:n deletions in a character string (DNA sequence)
  ##  return a character vector of all possible variants
  ##
  makeDel <- function(seq, n) {
  # accept only a single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

  # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n >= nseq) stop("too many deletions for ", nseq, " letters")

  # create all possible combinations to be dropped in 'index'
    index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
    index <- unlist(index, recursive = FALSE)

  # drop base in each possible position and reassemble
    ans <- lapply(index, function(idx) cseq[-idx])
    ans <- sapply(ans, paste, collapse = "")
    ans <- unique(ans) # remove duplicates
    return(ans)
  }

To create insertions in a DNA string:

  ##
  ## makeIns - create 1:n insertions into DNA string (character vector)
  ##   where each insertion is one of a given IUPAC-specified symbol
  ##   return a character vector of all possible variants
  ##
  makeIns <- function(seq, n, symbol = "N") {
  # IUPAC codes for ambiguous bases
    iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
      W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
      D = "AGT", B = "CGT")

 # only accept single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

 # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    symbol <- toupper(symbol)
    if (nchar(symbol) != 1 | !symbol %in% names(iupac))
      stop("'symbol' must be a single valid IUPAC symbol")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n >= nseq) stop("seems like too many insertions for ", nseq, " letters")

  # which bases are to be inserted?
    ACGT <- strsplit(iupac[symbol], "")[[1]]

  # create all possible combinations of positions for the insertion 
    ipos <- seq_len(nseq - 1) # insert after this position
    index <- lapply(1:n, function(i) do.call(expand.grid, rep(list(ipos), i)))
    index <- lapply(index, function(v) split(v, seq_len(nrow(v))))
    index <- unlist(index, recursive = FALSE)
    index <- lapply(index, unlist)
    index <- lapply(index, sort)

  # place the required number of insertions after each position in index
    res <- lapply(index, function(idx) {
      tally <- rle(idx)$lengths
      breaks <- diff(c(0, idx, nseq))
      NN <- Map(base::rep, symbol, tally)
      spl <- split(cseq, rep(seq_along(breaks), breaks))
      sel <- head(seq_along(spl), -1)
      spl[sel] <- Map(base::c, spl[sel], NN)
      ans <- unlist(spl)
      if (length(ACGT) > 1) { # replicate and replace with appropriate bases
        sites <- grep(symbol, ans)
        nsites <- length(sites)
        nsymbol <- length(ACGT)

        bases <- expand.grid(rep(list(ACGT), nsites), stringsAsFactors = FALSE)
        bases <- as.matrix(bases)
        nvars <- nrow(bases)

        ans <- do.call(rbind, rep(list(ans), nvars))
        ans[, sites] <- bases
        ans <- split(ans, seq_len(nvars))
        ans <- lapply(ans, paste, collapse = "")
      }
      else
        ans <- paste(ans, collapse = "")
      return(ans)
    })
    res <- unlist(res)
    res <- unique(res)
    return(res)
  }

To create substitutions in DNA string:

  ##
  ## makeSub - create an arbitrary number of substitutions in each 1:n positions
  ##   with the IUPAC bases specified by 'symbol'
  ##   return a character vector with all possible variants
  ##
  makeSub <- function(seq, n, symbol = "N")
  {
  # IUPAC codes for ambiguous bases
    iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
      W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
      D = "AGT", B = "CGT")

  # accept only a single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

  # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    symbol <- toupper(symbol)
    if (nchar(symbol) != 1 | !symbol %in% names(iupac))
      stop("'symbol' must be a single valid IUPAC symbol")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n > nseq) stop("too many substitutions for ", nseq, " bases")

  # which bases are to be used for the substitution?
    ACGT <- strsplit(iupac[symbol], "")[[1]]

  # create all possible combinations of positions to be changed in 'index'
    index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
    index <- unlist(index, recursive = FALSE)

  # for each numeric vector in index, create as many variants as
  # alternative bases are needed, collect in 'ans'
    ans <- lapply(index, function(idx) {
      bases <- lapply(cseq[idx], function(v) setdiff(ACGT, v))
      bases <- bases[sapply(bases, length) > 0] # defensive 
      bases <- expand.grid(bases, stringsAsFactors = FALSE)
      bases <- as.matrix(bases)
      nvars <- nrow(bases)

      vars <- do.call(rbind, rep(list(cseq), nvars))
      vars[ ,idx] <- bases
      if (!is.null(vars))
        return(split(vars, seq_len(nvars)))
    })
    ans <- unlist(ans, recursive = FALSE)
    ans <- sapply(ans, paste, collapse = "")
    ans <- unique(ans) # remove duplicates
    return(ans)
  }

Examples of the output:

  makeDel(seq1, 0)
> [1] "GGCGACTG"

  makeDel(seq1, 1)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"

  makeDel(seq1, 2)
>  [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
>  [8] "CGACTG"  "GGACTG"  "GCACTG"  "GCGCTG"  "GCGATG"  "GCGACG"  "GCGACT" 
> [15] "GGGCTG"  "GGGATG"  "GGGACG"  "GGGACT"  "GGCCTG"  "GGCATG"  "GGCACG" 
> [22] "GGCACT"  "GGCGTG"  "GGCGCG"  "GGCGCT"  "GGCGAG"  "GGCGAT"  "GGCGAC" 

  makeIns(seq1, 1) # default form
>  [1] "GAGCGACTG" "GCGCGACTG" "GGGCGACTG" "GTGCGACTG" "GGACGACTG" "GGCCGACTG"
>  [7] "GGTCGACTG" "GGCAGACTG" "GGCGGACTG" "GGCTGACTG" "GGCGAACTG" "GGCGCACTG"
> [13] "GGCGTACTG" "GGCGACCTG" "GGCGAGCTG" "GGCGATCTG" "GGCGACATG" "GGCGACGTG"
> [19] "GGCGACTTG" "GGCGACTAG" "GGCGACTCG" "GGCGACTGG"

  makeIns(seq1, 1, symbol = "Y") # inserting only "C" or "T"
>  [1] "GCGCGACTG" "GTGCGACTG" "GGCCGACTG" "GGTCGACTG" "GGCTGACTG" "GGCGCACTG"
>  [7] "GGCGTACTG" "GGCGACCTG" "GGCGATCTG" "GGCGACTTG" "GGCGACTCG"

  makeSub("AAA", 1)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT"

  makeSub("AAA", 2)
>  [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT" "CCA" "GCA" "TCA"
> [13] "CGA" "GGA" "TGA" "CTA" "GTA" "TTA" "CAC" "GAC" "TAC" "CAG" "GAG" "TAG"
> [25] "CAT" "GAT" "TAT" "ACC" "AGC" "ATC" "ACG" "AGG" "ATG" "ACT" "AGT" "ATT"

These functions can be used together with vmatchPattern() to create variants and extract matches. One suggested approach would be to first find those sequences with mismatches using max.mismatch = 1 . Next , find sequences with deletions and with insertions using vmatchPattern() with fixed = FALSE and the default value of 0 for max.mismatch .

Alternatively , the explicit patterns generated by the helper functions can be passed to grep processes running in parallel! What follows shows the use of vmatchPattern but there may be reasons to perform the analysis with different tools. See the comments on this topic.

# first, allow mismatches to the original pattern
# the result is a "ByPos_MIndex" object of length 1000
  m1 <- vmatchPattern(seq1, ref, max.mismatch = 1) # as before...
  n1 <- elementNROWS(m1) # counts the number of elements (matches)
  which(n1 > 0) # which of the 1000 ref sequence had a match with 0 or 1 mismatches?
>  [1]  14  71  73  76  79  88  90 108 126 129 138 141 150 160 163 179 180 195 200
> [20] 205 212 225 227 239 241 246 247 255 276 277 280 299 310 335 338 345 347 357
> [39] 359 369 378 383 387 390 391 404 409 410 414 418 469 472 479 488 499 509 523
> [58] 531 533 567 571 574 580 588 590 591 594 601 634 636 646 654 667 679 685 694
> [77] 696 713 717 732 734 737 749 750 761 762 783 815 853 854 857 903 929 943 959
> [96] 969 981 986 998

# Second search each of the patterns with lapply
# generates seven lists of objects, each of length 10000
  pat2 <- makeDel(seq1, 1)
  m2 <- lapply(pat2, function(pat) vmatchPattern(pat, ref))

# generates 22 lists of objects, each of length 10000
  pat3 <- makeIns(seq1, 1)
  m3 <- lapply(pat3, function(pat) vmatchPattern(pat, ref))

The second and third results in m2 and m3 are lists of "ByPos_MIndex" objects. The example below extracts the number of matches from m2 and shows these matches in an abbreviated form with str() . Each value in a list identifies the reference sequence that had at least one match with respective pattern.

  n2 <- lapply(m2, elementNROWS)
  str(sapply(n2, function(n) which(n > 0)))
> List of 7
>  $ : int [1:14] 14 138 179 335 369 391 567 679 713 734 ...
>  $ : int [1:18] 138 200 240 298 310 343 510 594 598 599 ...
>  $ : int [1:15] 21 26 45 60 260 497 541 600 607 642 ...
>  $ : int [1:17] 27 54 120 121 123 132 210 242 244 257 ...
>  $ : int [1:18] 15 33 110 126 154 419 528 539 546 606 ...
>  $ : int [1:12] 24 77 79 139 525 588 601 679 770 850 ...
>  $ : int [1:15] 179 345 378 414 469 571 574 580 591 713 ...

This final example examines the third list of 22 "ByPos_MIndex" objects ( m3 ) by the same mechanism. It shows that some of the 22 variants fail to match, some match once and five match twice.

    n3 <- lapply(m3, elementNROWS) # extract all counts
    names(n3) <- sprintf("pat_%02d", seq_along(n3)) # for convenience
    str(lapply(n3, function(n) which(n > 0)))
> List of 22
>  $ pat_01: int 679
>  $ pat_02: int 391
>  $ pat_03: int(0) 
>  $ pat_04: int 737
>  $ pat_05: int(0) 
>  $ pat_06: int(0) 
>  $ pat_07: int 108
>  $ pat_08: int 276
>  $ pat_09: int 439
>  $ pat_10: int [1:2] 764 773
>  $ pat_11: int(0) 
>  $ pat_12: int [1:2] 22 820
>  $ pat_13: int 795
>  $ pat_14: int [1:2] 914 981
>  $ pat_15: int(0) 
>  $ pat_16: int 112
>  $ pat_17: int 884
>  $ pat_18: int(0) 
>  $ pat_19: int [1:2] 345 378
>  $ pat_20: int [1:2] 571 854
>  $ pat_21: int 574
>  $ pat_22: int(0) 

Needless to say, a lot of data wrangling remains in order to extract sequence information. This can be coded with the help pages for matchPattern and with some understanding of the pattern matching logic described in help("lowlevel-matching", package = "Biostrings") .

Although the routines in Biostrings use very fast and very memory-efficient algorithms for handling large sequences. Joe seems to find raw searching faster under other circumstances. There's always more to learn!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM