搜索DNA序列中的字符串

Question

I'm trying to look at certain patterns of nucleotide in a gene sequence. 我正在尝试查看基因序列中核苷酸的某些模式。 I've just done read.table to get it in but I've tried converting it to vectors and data frames as well. 我刚刚完成read.table来获取它，但是我也尝试将其转换为向量和数据帧。

How do I search for a pattern (such as AACG ) or even just a single nucleotide character? 如何搜索模式（例如AACG ）或什至只是一个核苷酸字符？ I've tried grep and %in% but those are returning null results. 我已经尝试了grep和%in%但是它们都返回空结果。 It's probably something relatively simple I'm overlooking. 我可能忽略了相对简单的事情。

This is how I'm getting the data into the program. 这就是我将数据输入程序的方式。 It's a huge file; 这是一个巨大的文件； 20,347 letters, all ACTG . 20,347个字母，全为ACTG 。

data <- read.table(MTHFR.txt)

I've been trying to get it into a character vector this way; 我一直在尝试以这种方式将其放入字符向量中。

data.cv <- as.character(data)

But that's creating a list of what appears to be row numbers, instead of the sequence of nucleotides. 但这正在创建似乎是行号的列表，而不是核苷酸序列。

The data is available online here . 数据可在此处在线获得。 As of right now this is what the head of the data is: 到目前为止，这是数据头：

head(data)
                                                                      V1
1 ATGACGATAAAGGCACGGCCTCCAACGAGACCTGTGGGCACGGCCATGTTGGGGGCGGGGCTTCCGGTCA
2 CCCGCGCCGGTGGTTTCCGCCCTGTAGGCCCGCCTCTCCAGCAACCTGACACCTGCGCCGCGCCCCTTCA
3 CTGCGTTCCCCGCCCCTGCAGCGGCCACAGTGGTGCGGCCGGCGGCCGAGCGTTCTGAGTCACCCGGGAC
4 TGGAGGGTGAGTGACGGCGAGGCCGGGGTCGCCGGGAGGGAGATCCTGGAGCCGGCAAACAACCTCCCGG
5 GGGCAAGGACGTGCTTGTGGGCGGGGAGCGCTGGAGGCCGGCCTGCCTCTCTTCTTGGGGGGGGCTGCCG
6 CCTCCCTTGCGCACCCTTCGCGGGATTAGTGTAACTCCCAATGGCTACCACTTCCAGCGACCGCCAACCC

Answer 1

For most sequence related bioinformatics tasks you really need to familiarize yourself with some of the more common packages within the Bioconductor project. 对于大多数与序列相关的生物信息学任务，您确实需要熟悉Bioconductor项目中的一些更常见的软件包。 Many of the common tasks have implemented solutions that are blisteringly fast. 许多常见的任务已经实现了非常快速的解决方案。

Biostrings has classes such as DNAString and DNAStringSet which are used to store and manipulate DNA strings efficiently, with corresponding classes for AA and RNA. Biostrings具有诸如DNAString和DNAStringSet之类的类，它们可用于有效地存储和操作DNA字符串，并具有对应的AA和RNA类。 Included are various functions for searching, reverse complimenting, etc. It sounds like you've already got your data imported but an alternative would be to use the readDNAStringSet() function. 其中包括用于搜索，反向称赞等的各种函数。听起来好像您已经导入了数据，但是另一种方法是使用readDNAStringSet()函数。

library(Biostrings)

data <- 'ATGACGATAAAGGCACGGCCTCCAACGAGACCTGTGGGCACGGCCATGTTGGGGGCGGGGCTTCC'
dna <- DNAString(data)

matchPattern('GGG', dna)

  Views on a 65-letter DNAString subject
subject: ATGACGATAAAGGCACGGCCTCCAACGAGACCTGTGGGCACGGCCATGTTGGGGGCGGGGCTTCC
views:
    start end width
[1]    36  38     3 [GGG]
[2]    51  53     3 [GGG]
[3]    52  54     3 [GGG]
[4]    53  55     3 [GGG]
[5]    57  59     3 [GGG]
[6]    58  60     3 [GGG]

countPattern('GGG', dna)
[1] 6

countPattern('GGA', reverseComplement(dna)) #number of occurrances of 'TCC' in forward strand
[1] 2

Answer 2

I propose a solution using biopython to read the sequence and get its reverse-complement, then using a straightforward algorithm to get the position of a simple known k-mer (If you want more sophisticated things, biopython has functionalities for searching motifs ). 我提出了一种使用biopython来读取序列并获得反向互补的解决方案，然后使用简单的算法来获取简单的已知k-mer的位置（如果您想要更复杂的东西，biopython具有搜索基序的功能）。

Reading your sequences from a file (assuming you have it in fasta format): 从文件中读取序列（假设您具有fasta格式）：

from Bio import SeqIO
seq_record = SeqIO.read("my_sequence.fa", format="fasta")

Make an uppercase (just in case) version of the forward and reverse complement: 制作大写（以防万一）正向和反向补语：

fwd = str(seq_record.seq.upper())
rev = str(seq_record.seq.reverse_complement().upper())

Find where the pattern occurs (positions will be in 0-based coordinates): 查找模式发生的位置（位置将在从0开始的坐标中）：

pattern = "ACTG"
k = len(pattern)

positions_in_fwd = [i for i in range(1 + len(fwd) - k) if fwd[i:i+k] == pattern]
positions_in_rev = [i for i in range(1 + len(rev) - k) if rev[i:i+k] == pattern]

(With the sequence and pattern you give, I find 24 locations for the pattern in the sequence and 20 in its reverse complement.) （根据您提供的序列和模式，我在序列中找到24个模式位置，在其反向补码中找到20个位置。）

Answer 3

"Look for certain patterns" is a bit vague. “寻找某些模式”有点含糊。 Are you trying to extract the pattern? 您是否要提取图案？ Are you trying to find which intervals it occurs in the text? 您是否要查找文本中出现的间隔？ I'm going to try to assume both, but add anything you can to help specify the objective. 我将尝试同时假设两者，但是请添加所有可以帮助您指定目标的内容。

library(stringi)
library(magrittr)
# Data from site you provided was stored to "t.txt" on
# my machine so starting there
a <- readLines('t.txt')

Data info 资料资讯

 > summary(a)
    Length     Class      Mode 
      292   character   character

Look at the head of the dataset 看一下数据集的头部

 > head(a,2)
    [1] "GTCAAGTTTTTTTGTTTATTTTTGAGACAGAGTCTGGCTCAATTGCCCAGGCTGAAGCAGAGGAGTGATC"
    [2] "TCAGCTCACTGCAACCTCTGCCTCCCGGGTTCAAGTGATTCTCCCGCCTCAGCTTCCTGAGTAGCTGGGA"
    > sum(nchar(a))
    [1] 20374

Now that we have the data let's extract the 'AACG' pattern 现在我们有了数据，让我们提取“ AACG”模式

> aa <- stri_extract_all_regex(a, 'AACG', 
                         omit_no_match = F, simplify = T) %>% 
unlist %>% as.character() %>% (function(x)x[!is.na(x)])

 > aa 
[1] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[9] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[17] "AACG"

Transforming the dataset into one continuous string: 将数据集转换为一个连续的字符串：

a_flat <- paste0(a, collapse = "")

And instead of extracting, we can find where in the text it occurs and turn into a data frame 而不是提取，我们可以找到它出现在文本中的什么位置，然后变成一个数据框

bb <- as.data.frame(stri_locate_all_regex(a_flat, "AACG")[[1]])

What this gives us is the locations in which the pattern occurs. 这给了我们模式发生的位置。

> bb
   start   end
1    807   810
2   1244  1247
3   1748  1751
4   1791  1794
5   2306  2309
6   3560  3563
7   4217  4220
8   4927  4930
9   6504  6507
10  8668  8671
11  9827  9830
12 10333 10336
13 11446 11449
14 12779 12782
15 13619 13622
16 16604 16607
17 16659 16662
18 19200 19203
19 20181 20184
20 20228 20231

And we can use those locations to split the flattened string into what we want 我们可以使用这些位置将扁平化的字符串分割成我们想要的

 > sapply(1:nrow(bb), function(i){
    stri_sub(a_flat, bb[i,'start'], bb[i,'end'])
})
[1] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[9] "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG" "AACG"
[17] "AACG" "AACG" "AACG" "AACG"

搜索DNA序列中的字符串

问题描述

3 个解决方案

解决方案1
3 2016-12-01 05:38:26

解决方案2
2 2016-12-01 17:46:02

解决方案3
1 2016-12-01 06:11:56

Data info 资料资讯

Look at the head of the dataset 看一下数据集的头部

Now that we have the data let's extract the 'AACG' pattern 现在我们有了数据，让我们提取“ AACG”模式

Transforming the dataset into one continuous string: 将数据集转换为一个连续的字符串：

And instead of extracting, we can find where in the text it occurs and turn into a data frame 而不是提取，我们可以找到它出现在文本中的什么位置，然后变成一个数据框

What this gives us is the locations in which the pattern occurs. 这给了我们模式发生的位置。

And we can use those locations to split the flattened string into what we want 我们可以使用这些位置将扁平化的字符串分割成我们想要的

Hopefully this sheds a little light for you 希望这对您有所启发

搜索DNA序列中的字符串

问题描述

3 个解决方案

解决方案1 3 2016-12-01 05:38:26

解决方案2 2 2016-12-01 17:46:02

解决方案3 1 2016-12-01 06:11:56

Data info 资料资讯

Look at the head of the dataset 看一下数据集的头部

Now that we have the data let's extract the 'AACG' pattern 现在我们有了数据，让我们提取“ AACG”模式

Transforming the dataset into one continuous string: 将数据集转换为一个连续的字符串：

And instead of extracting, we can find where in the text it occurs and turn into a data frame 而不是提取，我们可以找到它出现在文本中的什么位置，然后变成一个数据框

What this gives us is the locations in which the pattern occurs. 这给了我们模式发生的位置。

And we can use those locations to split the flattened string into what we want 我们可以使用这些位置将扁平化的字符串分割成我们想要的

Hopefully this sheds a little light for you 希望这对您有所启发

解决方案1
3 2016-12-01 05:38:26

解决方案2
2 2016-12-01 17:46:02

解决方案3
1 2016-12-01 06:11:56