简体   繁体   中英

count 20-mers number in a fasta by python

A regular fasta file with reads length of 120 nt: 'single_mapped.fa'

A CSV file contains 10000 20-mers and the count for each 20mer: '20frequent_20mers.txt', like this:

AAAAAGTATAGGAGATAGAA    35
AAAAATAGGAGGACTATTCA    26
AAAAATAGGAGGACTATTTA    24
AAAAATAGGAGGCCTATTCA    62

I want to go through the single_mapped.fa, calculate accumulated counts of all 20-mers in 20frequent_20mers.txt for each reads, that is, for read:

AAAAAGTATAGGAGATAGAA AAAAATAGGAGGACTATTCA, I want to have 61 (35+26)

my code:

file2 = open('20frequent_20mers.txt','r')
kmer_list = csv.reader(file2, delimiter='\t')

for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
    print(seq_record.id)
    score_fre = 0
    sequence_string = str(seq_record.seq)
    for i in range(0,101):
            seq = sequence_string[i:i+20]
            for row in kmer_list:
                if row[0] == seq:
                    score_fre = score_fre + int(row[1])            
    print(score_fre)

Each loop works well when I run them separately, but did not work as the above, could anyone tell me where are the mistakes from? or if there is a more smart and efficient way to do this? Thanks in advance!

With the code as you have it, you would need to re-read your kmer file from the start for every sequence and i value. This would be very slow and should be avoided. As you are not moving the file pointer back to the start, it will only work once.

The file pointer could be moved by adding before the for row in kmer_list: line:

file2.seek(0)

A much better approach would be to first load all of your kmer entries into a dictionary along with the corresponding count. That way they could be looked up quickly:

import csv

kmers = {}

with open('20frequent_20mers.txt') as f_kmers:
    for kmer, count in csv.reader(f_kmers, delimiter='\t'):
        kmers[kmer] = int(count)

for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
    print(seq_record.id)
    score_fre = 0
    sequence_string = str(seq_record.seq)

    for i in range(0, 101):
        seq = sequence_string[i:i+20]
        score_fre += kmers.get(seq, 0)

    print(score_fre) 

If seq is not found in the dictionary, the default value of 0 is returned.

Alternative implementation (not necessarily better nor faster) with @MartinEvans dictionary but using re.findall() to generate kmers to test and using map and sum instead of an (explicit) inner loop:

from Bio import SeqIO
from re import findall
from itertools import repeat

kmers = {}

with open('20frequent_20mers.txt') as f_kmers:
    for line in f_kmers:
        kmer, count = line.strip().split('\t')
        kmers[kmer] = int(count)

for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
    print(seq_record.id)

    # use forward lookahead to make findall() find overlapping results;

    score_fre = sum(map(kmers.get, findall(r'(?=([ACTG]{20}))', str(seq_record.seq)), repeat(0)))

    print(score_fre)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM