Python filter（）和sort（）耗时太长

Question

我对Python很新，并将其用于打开文件夹的脚本，仅过滤文件，然后根据修改的时间按降序对它们进行排序。 在此之后，脚本从第一个日志文件开始，并搜索包含单词“failure”的任何行，同时保留每次查找的行数。 然后它将此信息写入单独的文件。

我遇到的问题是这个脚本需要20-30分钟才能运行。 该文件夹包含5k +文件，但不必遍历所有文件。 该脚本将上次运行时触及的第一个文件存储在单独的文件中，并在再次访问该文件时停止处理。

我发现脚本耗时太长的地方是使用内置的filter()和sort()方法。 任何人都可以提供原因，为什么它如此缓慢，并可能提供解决方案？

os.chdir(path)
files = filter(os.path.isfile, os.listdir(prod_path))
files.sort(key = os.path.getmtime, reverse = True)

for file in files:
    if file == break_file:
            break
    f = open(path + file).readlines()
    count = 0 #set count of errors to 0
    dest.write('Upload: ' + file[:file.index(".")] + '\n') #filename, need to truncate off filetype
    for line in f: #for each line in the the list of lines read from file
        if line.find('Failure') != -1 and line != f[0]:
            dest.write(line + '\n')
            count += 1
    dest.write('Number of Errors: ' + str(count) + '\n\n')

if dest.tell() == 0:
    dest.write('No files to process!')
dest.close()
update_file = open(last_run_file, 'w') #last_run_file stores break_file
update_file.write(str(files[0]))
print "done"

Answer 1

问题，我注意到了：

正如@ dm03514提到的readlines()是一个坏主意。 这可能会导致高速交换。 最好for line in open(path + file):调用for line in open(path + file):
if 'Failure' in line:改变条件if 'Failure' in line: 。 它会更加str.find() ，没有调用str.find()可以更快

line != f[0]是我想的第一行的检查，所以最好跳过一次：

 log_file = open(path + file) # read first line log_file.readline() # read rest lines for line in log_file: if 'Failure' in line:

多线程：Python有GIL，但它只影响CPU操作，因此您可以在单独的线程中解析每个文件。 请参阅线程文档

Answer 2

一些小的加速（你的帖子中没有太多的细节，所以这是我能做的最好的）：

import itertools

os.chdir(path)
fnames = [fname for fname in os.listdir(prod_path) if os.path.isfile(fname)
fnames.sort(key=os.path.getmtime, reverse=True)
firstfile, fnames = itertools.tee(itertools.takewhile(lambda fname: fname != break_file, fnames))
firstfile = next(firstfile)

with open('path/to/dest', 'w') as dest:
    errord = False
    for fname in fnames:
        with open(os.path.join(path, fname)) as infile:
            numErrors = 0
            dest.write('Upload: %s\n' %(fname.rsplit('.')[0]))
            infile.readline()
            for line in infile:
                if "Failure" not in line: continue
                dest.write(line)
                numErrors += 1
            dest.write('Number of Errors: %d\n\n' %numErrors)

    if not errord:
        dest.write('No files to process!')

with open(last_run_file, 'w') as update_file:
    update_file.write(firstfile)

Python filter（）和sort（）耗时太长

问题描述

2 个解决方案

解决方案1
2 2015-01-16 21:18:48

解决方案2
1 2015-01-16 22:06:16

Python filter（）和sort（）耗时太长

问题描述

2 个解决方案

解决方案1 2 2015-01-16 21:18:48

解决方案2 1 2015-01-16 22:06:16

解决方案1
2 2015-01-16 21:18:48

解决方案2
1 2015-01-16 22:06:16