I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).
This is the first thing I tired to achieve that result:
mydict = {}
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
with open("dataset.nt", "rb") as input:
for line in input:
if regex.search(line):
total = total+1
mydict[i] = total
It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:
mydict = {}
with open("dataset.nt", "rb") as input:
for line in input:
for i in mylist:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:
mydict = {}
file = open("dataset.nt", "rb")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
for i in list:
regex = re.compile(r"/Main/"+re.escape(i))
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
That one has been running for the last 30 minutes, so I'm assuming it's not any better.
How should I structure this code so that it completes in a reasonable amount of time?
I'd favor a slight modification of your second version:
mydict = {}
re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if not '/Main/' in line:
continue
# do the regex-part
for i, regex in zip(mylist, re_list):
total = 0
if regex.search(line):
total = total+1
mydict[i] = total
As @matsjoyce already suggested, this avoids re-compiling the regex on each iteration. If you really need to that many different regex patterns then I don't think there's much you can do.
Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.
Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.
This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):
targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)
with open("dataset.nt", "rb") as input:
for line in input:
# any match has to contain the "/Main/" part
# -> check it's there
# that may help a lot or not at all
# depending on what's in your file
if '/Main/' not in line:
continue
# do the regex-part
for i, regex in targets:
if regex.search(line):
results[i] += 1
Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort
program for exemple), you wouldn't have to check each line for each value of i
. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).
The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall .
Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.
import re
mydict = {}
mylist = [...] # A list with 500 items
# Optimizing calls
findall = re.findall # Then python don't have to resolve this functions for every call
escape = re.escape
with open("dataset.nt", "rb") as input:
text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
for elem in mylist:
mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.
I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).
I don't test the whole code with real data, just the findall
part.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.