需要很长时间才能运行的python脚本

Question

我在python中编写脚本a来解析ldap日志，然后获取每个用户的搜索/绑定数。 我正在示例文件和较小文件上测试我的代码，直到5-10MB大小，它可以快速运行并在本地PC上在一分钟内完成。 但是，当我在价值1800万的文件中运行脚本时，其中包含150000行，大约需要5分钟，我想以100M的文件大小运行此脚本，每次运行可能是5-6个文件，因此这意味着脚本具有每次运行可解析几乎600-700M的数据。 但是我想这将需要很长时间才能运行，因此如果可以对我的以下代码进行微调以提高执行时间的性能，我需要你们的一些建议。

import os,re,datetime
from collections import defaultdict

d=defaultdict(list)
k=defaultdict(list)
start_time=datetime.datetime.now()

fh = open("C:\\Rohit\\ECD Utilization Script - Copy\\logdir\\access","r").read()
pat=re.compile(' BIND REQ .*conn=([\d]*).*dn=(.*")')

srchStr='\n'.join(re.findall(r' SEARCH REQ .*',fh))

bindlist=re.findall(pat,fh)
for entry in bindlist:
    d[entry[-1].split(",")[0]].append(entry[0])

for key in d:
    for con in d[key]:
        count = re.findall(con,srchStr)
        k[key].append((con,len(count)))

#
for key in k:
    print("Number of searches by ",key, " : ",sum([i[1] for i in k[key]]))

for key in d:
    print("No of bind  ",key," = ",len(d[key]))


end_time=datetime.datetime.now()
print("Total time taken - {}".format(end_time-start_time))

Answer 1

您正在对该行的整个文件进行几次扫描

count = re.findall('SEARCH REQ.*'+conid,fh1)

避免这种情况。 这是您的主要问题。 在列表中获取所有conid，然后再次在文件上进行迭代，然后列出，而您的内部循环应由conid组成。 将其带出外部循环。 您将对文件进行两次扫描。

另外，由于它是纯Python与PyPy一起运行，因此运行速度更快。

使用FSM并花费更多的RAM可以更好地做到这一点。 这是一个提示，您必须自己进行FSM。

编辑1：这是我在查看日志文件后编写的脚本版本。 如果有任何错误，请更正：

#!/usr/bin/env python

import sys
import re


def parse(filepath):
        d = {}
        regex1 = re.compile(r'(.*)?BIND\sREQ(.*)uid=(\w+)')
        regex2 = re.compile(r'(.*)?SEARCH\sREQ(.*)uid=(\w+)')
        with open(filepath, 'r') as f:
                for l in f:
                        m = re.search(regex1, l)
                        if m:
                                # print (m.group(3))
                                uid = m.group(3)
                                if uid in d:
                                        d[uid]['bind_count'] += 1
                                else:
                                        d[uid] = {}
                                        d[uid]['bind_count'] = 1
                                        d[uid]['search_count'] = 0
                        m = re.search(regex2, l)
                        if m:
                                # print (m.group(3))
                                uid = m.group(3)
                                if uid in d:
                                        d[uid]['search_count'] += 1
                                else:
                                        d[uid] = {}
                                        d[uid]['search_count'] = 1
                                        d[uid]['bind_count'] = 0

        for k in d:
                print('user id = ' + k, 'Bind count = ' + str(d[k]['bind_count']), 'Search count = ' + str(d[k]['search_count']))


def process_args():
        if sys.argv < 2:
                print('Usage: parse_ldap_log.py log_filepath')
                exit(1)



if __name__ == '__main__':
        process_args()
    parse(sys.argv[1])

感谢众神，它还不够复杂，无法使用FSM。

Answer 2

使用itertools库而不是那么多循环。

Answer 3

您的脚本具有二次复杂度：对于文件中的每一行，您都在进行一次读取以匹配日志条目。 我的建议是只读取文件一次，并计算所需条目（一次匹配（“ BIND REQ”））的出现。

Answer 4

我可以使用以下代码解决我的问题。

import os,re,datetime
from collections import defaultdict



start_time=datetime.datetime.now()

bind_count=defaultdict(int)
search_conn=defaultdict(int)
bind_conn=defaultdict(str)
j=defaultdict(int)



fh = open("C:\\access","r")
total_searches=0
total_binds=0

for line in fh:
    reg1=re.search(r' BIND REQ .*conn=(\d+).*dn=(.*")', line)
    reg2=re.search(r' SEARCH REQ .*conn=(\d+).*', line)
    if reg1:
        total_binds+=1
        uid,con=reg1.group(2,1)
        bind_count[uid]=bind_count[uid]+1
        bind_conn[con]=uid

    if reg2:
        total_searches+=1
        skey=reg2.group(1)
        search_conn[skey] = search_conn[skey]+1


for conid in search_conn:
    if conid in bind_conn:
        new_key=bind_conn[conid]
        j[new_key]=j[new_key]+search_conn[conid]




for k,v in bind_count.items():
    print(k," = ",v)

print("*"*80)

for k,v in j.items():
    print(k,"-->",v)

fh.close()

del search_conn
del bind_conn

end_time=datetime.datetime.now()
print("Total time taken - {}".format(end_time-start_time))

需要很长时间才能运行的python脚本

问题描述

4 个解决方案

解决方案1
1 2016-12-15 07:40:13

解决方案2
0 2016-12-15 07:39:54

解决方案3
0 2016-12-15 07:40:07

解决方案4
0 已采纳 2016-12-22 06:23:52

需要很长时间才能运行的python脚本

问题描述

4 个解决方案

解决方案1 1 2016-12-15 07:40:13

解决方案2 0 2016-12-15 07:39:54

解决方案3 0 2016-12-15 07:40:07

解决方案4 0 已采纳 2016-12-22 06:23:52

解决方案1
1 2016-12-15 07:40:13

解决方案2
0 2016-12-15 07:39:54

解决方案3
0 2016-12-15 07:40:07

解决方案4
0 已采纳 2016-12-22 06:23:52