Python filter() and sort() taking too long

Question

I am fairly new to Python and utilized it for a script to open a folder, filter only to files and then sort them in descending order based on modified time. After this the script starts at the first log file and searches for any line that contains the word 'failure', also keeping a count of how many it finds each time. It then writes this info to a separate file.

The issue I'm having is that this script is taking 20-30 mins to run. The folder contains 5k+ files, however it does not have to iterate through all of them. The script stores on a separate file the first file it touched the last time it ran, and stops processing once it hits the file again.

Where I am finding the script is taking too long is on using the built in filter() and sort() methods. Can anyone offer reasons as to why it is so slow, and perhaps offer a solution?

os.chdir(path)
files = filter(os.path.isfile, os.listdir(prod_path))
files.sort(key = os.path.getmtime, reverse = True)

for file in files:
    if file == break_file:
            break
    f = open(path + file).readlines()
    count = 0 #set count of errors to 0
    dest.write('Upload: ' + file[:file.index(".")] + '\n') #filename, need to truncate off filetype
    for line in f: #for each line in the the list of lines read from file
        if line.find('Failure') != -1 and line != f[0]:
            dest.write(line + '\n')
            count += 1
    dest.write('Number of Errors: ' + str(count) + '\n\n')

if dest.tell() == 0:
    dest.write('No files to process!')
dest.close()
update_file = open(last_run_file, 'w') #last_run_file stores break_file
update_file.write(str(files[0]))
print "done"

Answer 1

Problems, that I've noticed:

as @dm03514 mentioned readlines() is a bad idea. This could lead to high swapping. Better to call for line in open(path + file):
Change condition to if 'Failure' in line: . It will be more pythonic, and without a call str.find() can be faster

line != f[0] is a check of first line I suppose, so it's better to skip it once:

 log_file = open(path + file) # read first line log_file.readline() # read rest lines for line in log_file: if 'Failure' in line:

Multithreading: Python has GIL, but it affects only CPU operations, so you could make a parsing of each file in separate thread. See threading documentation

Answer 2

Some minor speedups (there isn't too much detail in your post, so this is the best I can do):

import itertools

os.chdir(path)
fnames = [fname for fname in os.listdir(prod_path) if os.path.isfile(fname)
fnames.sort(key=os.path.getmtime, reverse=True)
firstfile, fnames = itertools.tee(itertools.takewhile(lambda fname: fname != break_file, fnames))
firstfile = next(firstfile)

with open('path/to/dest', 'w') as dest:
    errord = False
    for fname in fnames:
        with open(os.path.join(path, fname)) as infile:
            numErrors = 0
            dest.write('Upload: %s\n' %(fname.rsplit('.')[0]))
            infile.readline()
            for line in infile:
                if "Failure" not in line: continue
                dest.write(line)
                numErrors += 1
            dest.write('Number of Errors: %d\n\n' %numErrors)

    if not errord:
        dest.write('No files to process!')

with open(last_run_file, 'w') as update_file:
    update_file.write(firstfile)

Python filter() and sort() taking too long

Question

2 answers

solution1
2 2015-01-16 21:18:48

solution2
1 2015-01-16 22:06:16

Python filter() and sort() taking too long

Question

2 answers

solution1 2 2015-01-16 21:18:48

solution2 1 2015-01-16 22:06:16

solution1
2 2015-01-16 21:18:48

solution2
1 2015-01-16 22:06:16