In python, is there a way for re.finditer to take a file as input instead of a string?

Question

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:

f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
    doSomething()

Is there a way to do this without having to store the entire file in memory?

NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.

UPDATE: I would also like this to work with stdin if possible.

UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.

Answer 1

Either you will have to read the file chunk-wise, with overlaps to allow for the maximum possible length of the expression, or use an mmapped file, which will work almost/just as good as using a stream: https://docs.python.org/library/mmap.html

UPDATE to your UPDATE: consider that stdin isn't a file, it just behaves a lot like one in that it has a file descriptor and so on. it is a posix stream. if you are unclear on the difference, do some googling around. the OS cannot mmap it, therefore python can not.

also consider that what you're doing may be an ill-suited thing to use a regex for. regex's are great for capturing small stuff, like parsing a connection string, a log entry, csv data and so on. they are not a good tool to parse through huge chunks of data. this is by design. you may be better off writing a custom parser.

some words of wisdom from the past: http://regex.info/blog/2006-09-15/247

Answer 2

If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.

from collections import deque

def textwindow(filename, numlines):
    with open(filename) as f:
        window   = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
        nextline = True
        while nextline:
            text = "".join(window)
            yield text
            nextline = f.readline()
            window.append(nextline)

 for text in textwindow("bigfile.txt", 10):
     # test to see whether your regex matches and do something

Answer 3

也许你可以编写一个函数，在文件一次产生一行（读取一行）并在其上调用re.finditer，直到它产生一个EOF信号。

Answer 4

Here is another solution, using an internal text buffer to progressively yield found matches without loading the entire file in memory.

This buffer acts like a "sliding windows" through the file text, moving forward while yielding found matches.

As the file content is loaded by chunks, this means this solution works with multilines regexes too.

def find_chunked(fileobj, regex, *, chunk_size=4096):
    buffer = ""

    while 1:
        text = fileobj.read(chunk_size)
        buffer += text
        matches = list(regex.finditer(buffer))

        # End of file, search through remaining final buffer and exit
        if not text:
            yield from matches
            break

        # Yield found matches except the last one which is maybe 
        # incomplete because of the chunk cut (think about '.*')
        if len(matches) > 1:
            end = matches[-2].end()
            buffer = buffer[end:]
            yield from matches[:-1]

However, note that it may end up loading the whole file in memory if no matches are found at all, so you better should use this function if you are confident that your file contains the regex pattern many times.

In python, is there a way for re.finditer to take a file as input instead of a string?

Question

4 answers

solution1
5 2012-06-20 20:40:30

solution2
5 ACCPTED 2012-06-20 20:51:54

solution3
0 2012-06-20 20:37:51

solution4
0 2018-05-07 22:42:57

In python, is there a way for re.finditer to take a file as input instead of a string?

Question

4 answers

solution1 5 2012-06-20 20:40:30

solution2 5 ACCPTED 2012-06-20 20:51:54

solution3 0 2012-06-20 20:37:51

solution4 0 2018-05-07 22:42:57

solution1
5 2012-06-20 20:40:30

solution2
5 ACCPTED 2012-06-20 20:51:54

solution3
0 2012-06-20 20:37:51

solution4
0 2018-05-07 22:42:57