Getting data from fastq by generator

Question

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:

@hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII

I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea? I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions. thanks in advance. Paulo I got some ideas from a code in Biostar:

import sys
import gzip

filename = sys.argv[1]

def parsing_fastq_files(filename):
    with gzip.open(filename, "rb") as infile:
        count_lines = 0
            for line in infile:
            line = line.decode()
            if count_lines % 4 == 0:
                ids = line[1:].strip()
                yield ids
            if count_lines == 1:
                reads = line.rstrip()
                yield reads
        count_lines += 1

total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))

I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

Answer 1

Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.

Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.

Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

Getting data from fastq by generator

Question

1 answers

solution1
1 2019-03-06 14:01:38

Getting data from fastq by generator

Question

1 answers

solution1 1 2019-03-06 14:01:38

solution1
1 2019-03-06 14:01:38