简体   繁体   中英

Getting data from fastq by generator

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:

@hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII

I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea? I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions. thanks in advance. Paulo I got some ideas from a code in Biostar:

import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
    with gzip.open(filename, "rb") as infile:
        count_lines = 0
            for line in infile:
            line = line.decode()
            if count_lines % 4 == 0:
                ids = line[1:].strip()
                yield ids
            if count_lines == 1:
                reads = line.rstrip()
                yield reads
        count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))

I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM