简体   繁体   English

Python filter()和sort()耗时太长

[英]Python filter() and sort() taking too long

I am fairly new to Python and utilized it for a script to open a folder, filter only to files and then sort them in descending order based on modified time. 我对Python很新,并将其用于打开文件夹的脚本,仅过滤文件,然后根据修改的时间按降序对它们进行排序。 After this the script starts at the first log file and searches for any line that contains the word 'failure', also keeping a count of how many it finds each time. 在此之后,脚本从第一个日志文件开始,并搜索包含单词“failure”的任何行,同时保留每次查找的行数。 It then writes this info to a separate file. 然后它将此信息写入单独的文件。

The issue I'm having is that this script is taking 20-30 mins to run. 我遇到的问题是这个脚本需要20-30分钟才能运行。 The folder contains 5k+ files, however it does not have to iterate through all of them. 该文件夹包含5k +文件,但不必遍历所有文件。 The script stores on a separate file the first file it touched the last time it ran, and stops processing once it hits the file again. 该脚本将上次运行时触及的第一个文件存储在单独的文件中,并在再次访问该文件时停止处理。

Where I am finding the script is taking too long is on using the built in filter() and sort() methods. 我发现脚本耗时太长的地方是使用内置的filter()sort()方法。 Can anyone offer reasons as to why it is so slow, and perhaps offer a solution? 任何人都可以提供原因,为什么它如此缓慢,并可能提供解决方案?

os.chdir(path)
files = filter(os.path.isfile, os.listdir(prod_path))
files.sort(key = os.path.getmtime, reverse = True)

for file in files:
    if file == break_file:
            break
    f = open(path + file).readlines()
    count = 0 #set count of errors to 0
    dest.write('Upload: ' + file[:file.index(".")] + '\n') #filename, need to truncate off filetype
    for line in f: #for each line in the the list of lines read from file
        if line.find('Failure') != -1 and line != f[0]:
            dest.write(line + '\n')
            count += 1
    dest.write('Number of Errors: ' + str(count) + '\n\n')

if dest.tell() == 0:
    dest.write('No files to process!')
dest.close()
update_file = open(last_run_file, 'w') #last_run_file stores break_file
update_file.write(str(files[0]))
print "done"    

Problems, that I've noticed: 问题,我注意到了:

  1. as @dm03514 mentioned readlines() is a bad idea. 正如@ dm03514提到的readlines()是一个坏主意。 This could lead to high swapping. 这可能会导致高速交换。 Better to call for line in open(path + file): 最好for line in open(path + file):调用for line in open(path + file):
  2. Change condition to if 'Failure' in line: . if 'Failure' in line:改变条件if 'Failure' in line: It will be more pythonic, and without a call str.find() can be faster 它会更加str.find() ,没有调用str.find()可以更快
  3. line != f[0] is a check of first line I suppose, so it's better to skip it once: line != f[0]是我想的第一行的检查,所以最好跳过一次:

     log_file = open(path + file) # read first line log_file.readline() # read rest lines for line in log_file: if 'Failure' in line: 
  4. Multithreading: Python has GIL, but it affects only CPU operations, so you could make a parsing of each file in separate thread. 多线程:Python有GIL,但它只影响CPU操作,因此您可以在单独的线程中解析每个文件。 See threading documentation 请参阅线程文档

Some minor speedups (there isn't too much detail in your post, so this is the best I can do): 一些小的加速(你的帖子中没有太多的细节,所以这是我能做的最好的):

import itertools

os.chdir(path)
fnames = [fname for fname in os.listdir(prod_path) if os.path.isfile(fname)
fnames.sort(key=os.path.getmtime, reverse=True)
firstfile, fnames = itertools.tee(itertools.takewhile(lambda fname: fname != break_file, fnames))
firstfile = next(firstfile)

with open('path/to/dest', 'w') as dest:
    errord = False
    for fname in fnames:
        with open(os.path.join(path, fname)) as infile:
            numErrors = 0
            dest.write('Upload: %s\n' %(fname.rsplit('.')[0]))
            infile.readline()
            for line in infile:
                if "Failure" not in line: continue
                dest.write(line)
                numErrors += 1
            dest.write('Number of Errors: %d\n\n' %numErrors)

    if not errord:
        dest.write('No files to process!')

with open(last_run_file, 'w') as update_file:
    update_file.write(firstfile)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM