Parsing Huge structured file in python 2.7

Question

I am a newbie in the python world and bioinformatics. I am dealing with a almost 50GB structured file to write it out. So I would like to take some great tips from you.

The file goes like this. (it's actually called FASTQ_format)

@Machinename:~:Team1:atcatg   1st line.
atatgacatgacatgaca            2nd line.       
+                             3rd line.           
asldjfwe!@#$#%$               4th line.

These four lines are repeated in order. Those 4 lines are like a team. And I have nearly 30 candidates DNA sequences. eg atgcat , tttagc

What I am doing is have each candidate DNA sequence going through the huge file to find whether a candidate sequence is similar to Team dna sequence, which means allowing one mismatch to each (eg taaaaa = aaaata ) and if they are similar or same, I use dictionary to store them to write it out later. key for candidate DNA sequence. Value for (4 lines) in List to store them in order by line order

So what I have done is:

def myfunction(str1, str2): # to find if they are similar( allowed one mis match) if they are similar, it returns true

    f = open('hugefile')
    diction = {}
    mylist = ['candidate dna sequences1','dna2','dna3','dna4'...]
    while True:
      line = f.readline()
      if not line:
         break
      if "machine name" in line:
         teamseq = line.split(':')[-1]
         if my function(candidate dna, team dna) == True:
             if not candidate dna in diction.keys():
                diction[candidate dna] = []
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
             else:          # chances some same team dna are repeated.
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
    f.close()

    wf = open(hughfile+".out", 'w')
    for i in candidate dna list:   # dna 1 , dna2, dna3
          wf.write(diction[i] + '\n')
    wf.close()

My function doesn't use any global variables (I think I am happy with my function), whereas the dictionary variable is a global variable and takes all the data as well as making lots of list instances. The code is simple but so slow and such a big pain in the butt to the CPU and memory. I use pypy though.

So any tips write it out in order by line order?

Answer 1

I suggest opening input and output files simultaneously and writing to the output as you step through the input. As it is now, you are reading 50GB into memory and then writing it out. That is both slow and unnecessary.

IN PSEUDOCODE:

with open(huge file) as fin, open(hughfile+".out", 'w') as fout:
   for line in f:
      if "machine name" in line:
          # read the following 4 lines from fin as a record
          # process that record
          # write the record to fout
          # the input record in no longer needed -- allow to be garbage collected...

As I have outlined it, the previous 4 line records are written as they are encountered and then disposed of. If you need to refer to diction.keys() for previous records, only keep the minimum necessary as a set() to cut down the total size of the in-memory data.

Parsing Huge structured file in python 2.7

Question

1 answers

solution1
1 ACCPTED 2014-06-27 14:58:31

Parsing Huge structured file in python 2.7

Question

1 answers

solution1 1 ACCPTED 2014-06-27 14:58:31

solution1
1 ACCPTED 2014-06-27 14:58:31