简体   繁体   English

计算文件中某个三联体的数量(DNA密码子分析)

[英]count the number of a certain triplet in a file (DNA codon analysis)

This question is actually for DNA codon analysis, to put it in a simple way, let's say I have a file like this: 这个问题实际上是用于DNA密码子分析,用一个简单的方式来说,假设我有一个这样的文件:
atgaaaccaaag... atgaaaccaaag ...
and I want to count the number of 'aaa' triplet present in this file. 我想计算这个文件中存在的'aaa'三元组的数量。 Importantly, the triplets start from the very beginning (which means atg,aaa,cca,aag,...) So the result should be 1 instead of 2 'aaa' in this example. 重要的是,三元组从一开始就开始(这意味着atg,aaa,cca,aag,......)因此在这个例子中结果应该是1而不是2'aaa'。
Is there any Python or Shellscript methods to do this? 有没有Python或Shellscript方法来做到这一点? Thanks! 谢谢!

first readin the file 首先阅读文件

with open("some.txt") as f:
    file_data = f.read()

then split it into 3's 然后把它分成3个

codons = [file_data[i:i+3] for i in range(0,len(file_data),3)]

then count em 然后算上他们

print codons.count('aaa')

like so 像这样

>>> my_codons = 'atgaaaccaaag'
>>> codons = [my_codons[i:i+3] for i in range(0,len(my_codons),3)]
>>> codons
['atg', 'aaa', 'cca', 'aag']
>>> codons.count('aaa')
1

The obvious solution is to split the string into 3-character pieces and then count the number of occurrences of "aaa": 显而易见的解决方案是将字符串拆分为3个字符,然后计算“aaa”的出现次数:

s = 'atgaaaccaaag'
>>> [s[i : i + 3] for i in xrange(0, len(s), 3)].count('aaa')
1

If the string is really long then this solution will chew up some memory unnecessarily creating the list of substrings. 如果字符串真的很长,那么这个解决方案会不必要地创建子字符串列表来咀嚼一些内存。

s = 'atgaaaccaaag'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
1
>>> s = 'aaatttaaacaaagg'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
2

This uses a generator expression instead of creating a temporary list, so it will be more memory efficient. 这使用生成器表达式而不是创建临时列表,因此它将提高内存效率。 It takes advantage of the fact that True == 1 , ie True + True == 2 . 它利用了True == 1 ,即True + True == 2的事实。

You could first break the string into triples, using something like: 您可以先使用以下内容将字符串分解为三元组:

def split_by_size(input, length):
    return [input[i:i+length] for i in range(0, len(input), length)]

tripleList = split_by_size(input, length)

Then check for "aaa", and sum it up: 然后检查“aaa”,并总结:

print sum(filter(lambda x: x == "aaa", tripleList))

using a simple shell, assuming your fasta only contains one sequence. 使用一个简单的shell,假设你的fasta只包含一个序列。

grep -v ">"  < input.fa |
tr -d '\n' |
sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' |
tr "#" "\n" |
awk '(length($1)==3)' |
sort |
uniq -c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 写函数计算密码子“ ATG”的DNA序列 - Write functions counting the DNA sequence of codon “ ATG ” 替换dna序列文件中特定位置的核苷酸 - replace a nucleotide at a certain position in a dna sequence file python某些目录中文本文件的数量 - python count number of text file in certain directory PSET 6 DNA:如何计算连续 STR 的运行次数 - PSET 6 DNA: How to count number of runs of consecutive STRs python 中给定 mrna 序列的密码子计数 - Codon count on a given mrna sequence in python 有没有办法使用Python计算xml文件中某个名称的元素数量? - Is there a way to count the number of elements of a certain name in an xml file using Python? (Python) - 如何计算具有特定扩展名或名称的文件中的文件数 - (Python) - How to count number of files in a file with certain extension or name 定义一个函数来计算文件中的行数,包含某个子字符串 - Defining a function to count the number of lines in a file, containing a certain substring 需要计算“AGAT”、“AATG”和“TATC”在具有 DNA 序列的 .txt 文件中重复了多少次 - Need to count how many times “AGAT” “AATG” and “TATC” repeats in .txt file that has a DNA sequence 将 txt 文件(密码子)输入到 dict 中,如果存在相似性,则打印氨基酸 - input of txt file (codon) to a dict and if there is a similarity print the aminoacid
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM