count the number of a certain triplet in a file (DNA codon analysis)

Question

This question is actually for DNA codon analysis, to put it in a simple way, let's say I have a file like this:
atgaaaccaaag...
and I want to count the number of 'aaa' triplet present in this file. Importantly, the triplets start from the very beginning (which means atg,aaa,cca,aag,...) So the result should be 1 instead of 2 'aaa' in this example.
Is there any Python or Shellscript methods to do this? Thanks!

Answer 1

first readin the file

with open("some.txt") as f:
    file_data = f.read()

then split it into 3's

codons = [file_data[i:i+3] for i in range(0,len(file_data),3)]

then count em

print codons.count('aaa')

like so

>>> my_codons = 'atgaaaccaaag'
>>> codons = [my_codons[i:i+3] for i in range(0,len(my_codons),3)]
>>> codons
['atg', 'aaa', 'cca', 'aag']
>>> codons.count('aaa')
1

Answer 2

The obvious solution is to split the string into 3-character pieces and then count the number of occurrences of "aaa":

s = 'atgaaaccaaag'
>>> [s[i : i + 3] for i in xrange(0, len(s), 3)].count('aaa')
1

If the string is really long then this solution will chew up some memory unnecessarily creating the list of substrings.

s = 'atgaaaccaaag'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
1
>>> s = 'aaatttaaacaaagg'
>>> sum(s[i : i + 3] == 'aaa' for i in xrange(0, len(s), 3))
2

This uses a generator expression instead of creating a temporary list, so it will be more memory efficient. It takes advantage of the fact that True == 1 , ie True + True == 2 .

Answer 3

You could first break the string into triples, using something like:

def split_by_size(input, length):
    return [input[i:i+length] for i in range(0, len(input), length)]

tripleList = split_by_size(input, length)

Then check for "aaa", and sum it up:

print sum(filter(lambda x: x == "aaa", tripleList))

Answer 4

using a simple shell, assuming your fasta only contains one sequence.

grep -v ">"  < input.fa |
tr -d '\n' |
sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' |
tr "#" "\n" |
awk '(length($1)==3)' |
sort |
uniq -c

count the number of a certain triplet in a file (DNA codon analysis)

Question

4 answers

solution1
7 ACCPTED 2012-09-26 20:55:31

solution2
2 2012-09-26 20:58:16

solution3
1 2012-09-26 20:58:32

solution4
0 2012-09-26 21:56:02

count the number of a certain triplet in a file (DNA codon analysis)

Question

4 answers

solution1 7 ACCPTED 2012-09-26 20:55:31

solution2 2 2012-09-26 20:58:16

solution3 1 2012-09-26 20:58:32

solution4 0 2012-09-26 21:56:02

solution1
7 ACCPTED 2012-09-26 20:55:31

solution2
2 2012-09-26 20:58:16

solution3
1 2012-09-26 20:58:32

solution4
0 2012-09-26 21:56:02