What is the most efficient way of looping over each line of a file?

Question

I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).

This is the first thing I tired to achieve that result:

mydict = {}

for i in mylist:
    regex = re.compile(r"/Main/"+re.escape(i))
    total = 0
    with open("dataset.nt", "rb") as input:
        for line in input:
            if regex.search(line):
                total = total+1
    mydict[i] = total

It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:

mydict = {}

with open("dataset.nt", "rb") as input:
    for line in input:
        for i in mylist:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:

mydict = {}

file = open("dataset.nt", "rb")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        for i in list:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

That one has been running for the last 30 minutes, so I'm assuming it's not any better.

How should I structure this code so that it completes in a reasonable amount of time?

Answer 1

I'd favor a slight modification of your second version:

mydict = {}

re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if not '/Main/' in line:
            continue 

        # do the regex-part
        for i, regex in zip(mylist, re_list):
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

As @matsjoyce already suggested, this avoids re-compiling the regex on each iteration. If you really need to that many different regex patterns then I don't think there's much you can do.

Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.

Answer 2

Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.

This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):

targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)

with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if '/Main/' not in line:
            continue 

        # do the regex-part
        for i, regex in targets:
            if regex.search(line):
                results[i] += 1

Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i . Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).

Answer 3

The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall .

Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.

    import re

    mydict = {}
    mylist = [...] # A list with 500 items
    # Optimizing calls
    findall = re.findall  # Then python don't have to resolve this functions for every call
    escape = re.escape

    with open("dataset.nt", "rb") as input:
        text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
        for elem in mylist:
            mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.

I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).

I don't test the whole code with real data, just the findall part.

What is the most efficient way of looping over each line of a file?

Question

3 answers

solution1
1 ACCPTED 2014-10-29 17:44:37

solution2
0 2014-10-29 18:02:04

solution3
0 2014-10-29 18:27:23

What is the most efficient way of looping over each line of a file?

Question

3 answers

solution1 1 ACCPTED 2014-10-29 17:44:37

solution2 0 2014-10-29 18:02:04

solution3 0 2014-10-29 18:27:23

solution1
1 ACCPTED 2014-10-29 17:44:37

solution2
0 2014-10-29 18:02:04

solution3
0 2014-10-29 18:27:23