简体   繁体   中英

Search for characters string in a DNA sequence

I'm trying to look at certain patterns of nucleotide in a gene sequence. I've just done read.table to get it in but I've tried converting it to vectors and data frames as well.

How do I search for a pattern (such as AACG ) or even just a single nucleotide character? I've tried grep and %in% but those are returning null results. It's probably something relatively simple I'm overlooking.

This is how I'm getting the data into the program. It's a huge file; 20,347 letters, all ACTG .

data <- read.table(MTHFR.txt) 

I've been trying to get it into a character vector this way;

data.cv <- as.character(data)

But that's creating a list of what appears to be row numbers, instead of the sequence of nucleotides.

The data is available online here . As of right now this is what the head of the data is:

head(data)
                                                                      V1
1 ATGACGATAAAGGCACGGCCTCCAACGAGACCTGTGGGCACGGCCATGTTGGGGGCGGGGCTTCCGGTCA
2 CCCGCGCCGGTGGTTTCCGCCCTGTAGGCCCGCCTCTCCAGCAACCTGACACCTGCGCCGCGCCCCTTCA
3 CTGCGTTCCCCGCCCCTGCAGCGGCCACAGTGGTGCGGCCGGCGGCCGAGCGTTCTGAGTCACCCGGGAC
4 TGGAGGGTGAGTGACGGCGAGGCCGGGGTCGCCGGGAGGGAGATCCTGGAGCCGGCAAACAACCTCCCGG
5 GGGCAAGGACGTGCTTGTGGGCGGGGAGCGCTGGAGGCCGGCCTGCCTCTCTTCTTGGGGGGGGCTGCCG
6 CCTCCCTTGCGCACCCTTCGCGGGATTAGTGTAACTCCCAATGGCTACCACTTCCAGCGACCGCCAACCC

For most sequence related bioinformatics tasks you really need to familiarize yourself with some of the more common packages within the Bioconductor project. Many of the common tasks have implemented solutions that are blisteringly fast.

Biostrings has classes such as DNAString and DNAStringSet which are used to store and manipulate DNA strings efficiently, with corresponding classes for AA and RNA. Included are various functions for searching, reverse complimenting, etc. It sounds like you've already got your data imported but an alternative would be to use the readDNAStringSet() function.

library(Biostrings)

data <- 'ATGACGATAAAGGCACGGCCTCCAACGAGACCTGTGGGCACGGCCATGTTGGGGGCGGGGCTTCC'
dna <- DNAString(data)

matchPattern('GGG', dna)

  Views on a 65-letter DNAString subject
subject: ATGACGATAAAGGCACGGCCTCCAACGAGACCTGTGGGCACGGCCATGTTGGGGGCGGGGCTTCC
views:
    start end width
[1]    36  38     3 [GGG]
[2]    51  53     3 [GGG]
[3]    52  54     3 [GGG]
[4]    53  55     3 [GGG]
[5]    57  59     3 [GGG]
[6]    58  60     3 [GGG]

countPattern('GGG', dna)
[1] 6

countPattern('GGA', reverseComplement(dna)) #number of occurrances of 'TCC' in forward strand
[1] 2

I propose a solution using biopython to read the sequence and get its reverse-complement, then using a straightforward algorithm to get the position of a simple known k-mer (If you want more sophisticated things, biopython has functionalities for searching motifs ).

Reading your sequences from a file (assuming you have it in fasta format):

from Bio import SeqIO
seq_record = SeqIO.read("my_sequence.fa", format="fasta")

Make an uppercase (just in case) version of the forward and reverse complement:

fwd = str(seq_record.seq.upper())
rev = str(seq_record.seq.reverse_complement().upper())

Find where the pattern occurs (positions will be in 0-based coordinates):

pattern = "ACTG"
k = len(pattern)

positions_in_fwd = [i for i in range(1 + len(fwd) - k) if fwd[i:i+k] == pattern]
positions_in_rev = [i for i in range(1 + len(rev) - k) if rev[i:i+k] == pattern]

(With the sequence and pattern you give, I find 24 locations for the pattern in the sequence and 20 in its reverse complement.)

"Look for certain patterns" is a bit vague. Are you trying to extract the pattern? Are you trying to find which intervals it occurs in the text? I'm going to try to assume both, but add anything you can to help specify the objective.

library(stringi)
library(magrittr)
# Data from site you provided was stored to "t.txt" on
# my machine so starting there
a <- readLines('t.txt')

Data info

 > summary(a)
    Length     Class      Mode 
      292   character   character 

Look at the head of the dataset

 > head(a,2)
    [1] "GTCAAGTTTTTTTGTTTATTTTTGAGACAGAGTCTGGCTCAATTGCCCAGGCTGAAGCAGAGGAGTGATC"
    [2] "TCAGCTCACTGCAACCTCTGCCTCCCGGGTTCAAGTGATTCTCCCGCCTCAGCTTCCTGAGTAGCTGGGA"
    > sum(nchar(a))
    [1] 20374

Now that we have the data let's extract the 'AACG' pattern

> aa <- stri_extract_all_regex(a, 'AACG', 
                         omit_no_match = F, simplify = T) %>% 
unlist %>% as.character() %>% (function(x)x[!is.na(x)])

 > aa 
[1] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[9] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[17] "AACG"

Transforming the dataset into one continuous string:

a_flat <- paste0(a, collapse = "")

And instead of extracting, we can find where in the text it occurs and turn into a data frame

bb <- as.data.frame(stri_locate_all_regex(a_flat, "AACG")[[1]]) 

What this gives us is the locations in which the pattern occurs.

> bb
   start   end
1    807   810
2   1244  1247
3   1748  1751
4   1791  1794
5   2306  2309
6   3560  3563
7   4217  4220
8   4927  4930
9   6504  6507
10  8668  8671
11  9827  9830
12 10333 10336
13 11446 11449
14 12779 12782
15 13619 13622
16 16604 16607
17 16659 16662
18 19200 19203
19 20181 20184
20 20228 20231

And we can use those locations to split the flattened string into what we want

 > sapply(1:nrow(bb), function(i){
    stri_sub(a_flat, bb[i,'start'], bb[i,'end'])
})
[1] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[9] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[17] "AACG" "AACG" "AACG" "AACG"

Hopefully this sheds a little light for you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM