python script taking long time to run

Question

I am writing a script a in python to parse ldap logs and then get the number of searches/binds by each user. I was testing my code on sample files and for smaller files till size of 5-10MB it runs quick and completes within a 1 minute on my local PC. However when i ran the script on a file worth 18M having around 150000 lines in it, it takes around 5 minutes, i want to run this script on file sizes of 100M and maybe be 5-6 files in each run so that means script has to parse almost of 600-700M of data in each run. But i suppose it would take long time to run, so i would need some advise from you guys if my below code can be fine tuned for better performance in terms of execution time.

import os,re,datetime
from collections import defaultdict

d=defaultdict(list)
k=defaultdict(list)
start_time=datetime.datetime.now()

fh = open("C:\\Rohit\\ECD Utilization Script - Copy\\logdir\\access","r").read()
pat=re.compile(' BIND REQ .*conn=([\d]*).*dn=(.*")')

srchStr='\n'.join(re.findall(r' SEARCH REQ .*',fh))

bindlist=re.findall(pat,fh)
for entry in bindlist:
    d[entry[-1].split(",")[0]].append(entry[0])

for key in d:
    for con in d[key]:
        count = re.findall(con,srchStr)
        k[key].append((con,len(count)))

#
for key in k:
    print("Number of searches by ",key, " : ",sum([i[1] for i in k[key]]))

for key in d:
    print("No of bind  ",key," = ",len(d[key]))


end_time=datetime.datetime.now()
print("Total time taken - {}".format(end_time-start_time))

Answer 1

You are doing several scans on entire file on the line

count = re.findall('SEARCH REQ.*'+conid,fh1)

Avoid this. This is your major problem. Get all conids in a list and iterate on file again and list while your inner loop should consist of conids. Bring it out of outer loop. You will be doing two scans of file.

Also since it is plain Python run with PyPy for faster runs.

You can do this better with a FSM and by spending a bit more RAM. This is a hint and you will have to do your FSM yourself.

Edit 1: This is the version of script I wrote after seeing the log file. Please correct if there is any mistake:

#!/usr/bin/env python

import sys
import re


def parse(filepath):
        d = {}
        regex1 = re.compile(r'(.*)?BIND\sREQ(.*)uid=(\w+)')
        regex2 = re.compile(r'(.*)?SEARCH\sREQ(.*)uid=(\w+)')
        with open(filepath, 'r') as f:
                for l in f:
                        m = re.search(regex1, l)
                        if m:
                                # print (m.group(3))
                                uid = m.group(3)
                                if uid in d:
                                        d[uid]['bind_count'] += 1
                                else:
                                        d[uid] = {}
                                        d[uid]['bind_count'] = 1
                                        d[uid]['search_count'] = 0
                        m = re.search(regex2, l)
                        if m:
                                # print (m.group(3))
                                uid = m.group(3)
                                if uid in d:
                                        d[uid]['search_count'] += 1
                                else:
                                        d[uid] = {}
                                        d[uid]['search_count'] = 1
                                        d[uid]['bind_count'] = 0

        for k in d:
                print('user id = ' + k, 'Bind count = ' + str(d[k]['bind_count']), 'Search count = ' + str(d[k]['search_count']))


def process_args():
        if sys.argv < 2:
                print('Usage: parse_ldap_log.py log_filepath')
                exit(1)



if __name__ == '__main__':
        process_args()
    parse(sys.argv[1])

Thank the Gods that it was not complicated enough to warrant an FSM.

Answer 2

使用itertools库而不是那么多循环。

Answer 3

Your script has a quadratic complexity: for each line in the file you are making a read again to match the log entry. My suggestion is to read the file only one time and counting the occurrences of the needed entry (the one matching (" BIND REQ ")).

Answer 4

I was able to solve my problem with below code.

import os,re,datetime
from collections import defaultdict



start_time=datetime.datetime.now()

bind_count=defaultdict(int)
search_conn=defaultdict(int)
bind_conn=defaultdict(str)
j=defaultdict(int)



fh = open("C:\\access","r")
total_searches=0
total_binds=0

for line in fh:
    reg1=re.search(r' BIND REQ .*conn=(\d+).*dn=(.*")', line)
    reg2=re.search(r' SEARCH REQ .*conn=(\d+).*', line)
    if reg1:
        total_binds+=1
        uid,con=reg1.group(2,1)
        bind_count[uid]=bind_count[uid]+1
        bind_conn[con]=uid

    if reg2:
        total_searches+=1
        skey=reg2.group(1)
        search_conn[skey] = search_conn[skey]+1


for conid in search_conn:
    if conid in bind_conn:
        new_key=bind_conn[conid]
        j[new_key]=j[new_key]+search_conn[conid]




for k,v in bind_count.items():
    print(k," = ",v)

print("*"*80)

for k,v in j.items():
    print(k,"-->",v)

fh.close()

del search_conn
del bind_conn

end_time=datetime.datetime.now()
print("Total time taken - {}".format(end_time-start_time))

python script taking long time to run

Question

4 answers

solution1
1 2016-12-15 07:40:13

solution2
0 2016-12-15 07:39:54

solution3
0 2016-12-15 07:40:07

solution4
0 ACCPTED 2016-12-22 06:23:52

python script taking long time to run

Question

4 answers

solution1 1 2016-12-15 07:40:13

solution2 0 2016-12-15 07:39:54

solution3 0 2016-12-15 07:40:07

solution4 0 ACCPTED 2016-12-22 06:23:52

solution1
1 2016-12-15 07:40:13

solution2
0 2016-12-15 07:39:54

solution3
0 2016-12-15 07:40:07

solution4
0 ACCPTED 2016-12-22 06:23:52