简体   繁体   中英

Python hangs after finishing a script

I have encountered a weird behavior of Python that I've never seen before. I am running the following code:

from __future__ import print_function, division
import itertools
import sys

R1_file = sys.argv[1]
R2_file = sys.argv[2]
out_stats = sys.argv[3]

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return itertools.izip_longest(fillvalue=fillvalue, *args)

print('### Started preparing iterators')
# iterate on reads files and get vector of locations per sequencse
fid1 = open(R1_file)
fid2 = open(R2_file)
gr1 = grouper(fid1,4)
gr2 = grouper(fid2,4)
together = itertools.izip(gr1,gr2)

seq_hash = {}
seq_ind = 0
print('### Started reading fastq')
for blocks in together:
    seq_ind += 1
    if seq_ind%1000000 == 0:
        print('Read in',seq_ind,'reads')
    s1 = blocks[0][1]
    s2 = blocks[1][1]
    conc_seq = s1.strip()+s2.strip()
    if conc_seq in seq_hash:
        seq_hash[conc_seq].append(seq_ind)
    else:
        seq_hash[conc_seq] = [seq_ind]
fid1.close()
fid2.close()

# print results to file
print('### Started writing results to file')
with  open(out_stats,'w') as fo:
    for seq,locations_vec in seq_hash.iteritems():
       n = len(locations_vec)
       if n > 1:
          print(seq,n,':'.join(str(l) for l in locations_vec),sep='\t',file=fo)
    print('done writing to file')
print('FINISHED')

this script runs on two FASTQ files, which have a specific format, looks for duplicate data and produces some statistics. Now, the strange thing is that after the script is done running, that is, all the required stats are printed to the output file and 'FINISHED' is printed to STDOUT, the script just hags, doing seemingly nothing! The duration of the lag depends on the size of the input files: when I give 100M inputs, it hangs for a few seconds, when the input is 500M files, it hangs for about 10 minutes, when I run on my full data - ~130G - it practically never ends (I ran it overnight and it didn't finish). Again, everything that needs to be written to output and STDOUT is indeed written. During the lag time, cpu usage is high and the memory required for holding the data is still occupied. I tried to do some tracing using pdb, and it looks like the script is running the loop starting with for blocks in together: again and again, after already printing 'FINISHED' (although I might be interpreting the pdb output wrong).
Currently I just terminate the script whenever it gets to its lag phase, and can then work with the output with no problem, but this is still quite annoying. I am running on Ubuntu, using Python 2.7.
Any ideas?

As @gdlmx stated it is probably python cleaning up after closing the files. I had a similar problem with huge datasets stored as CSV (1e7 floating points per column). Huge lags and very long computing times were to be expected.

The only way to avoid this was using binary formats, and loading them into python via numpy. You would then need the specification of those binary files. An other option is to write a parser for the FASTQ files.

Also, if you didn't already know: BioPython offers some modules for parsing file formats commonly found in bioinformatics, among others FASTQ

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM