简体   繁体   中英

Python big file parsing with re

How to parse a large file with regular expressions (using the re module), without loading the whole file in string (or memory)? Memory mapped files don't help because their content can't be converted to some kind of lazy string. The re module only supports string as content argument.

#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>

int main(int argc, char* argv[])
{
    boost::iostreams::mapped_file fl("BigFile.log");
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
    boost::regex expr("something usefull");
    boost::match_flag_type flags = boost::match_default;
    boost::iostreams::mapped_file::iterator start, end;
    start = fl.begin();
    end = fl.end();
    boost::match_results<boost::iostreams::mapped_file::iterator> what;
    while(boost::regex_search(start, end, what, expr))
    {
        std::cout<<what[0].str()<<std::endl;
        start = what[0].second;
    }
    return 0;
}

To demonstrate my requirements. I wrote a short sample using C++(and boost) the same I want to have in Python.

Everything now works ok(Python 3.2.3 has some differences with Python 2.7 in interface). Search patter should be just prefixed with b" to have a working solution(in Python 3.2.3).

import re
import mmap
import pprint

def ParseFile(fileName):
    f = open(fileName, "r")
    print("File opened succesfully")
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    print("File mapped succesfully")
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
    for item in items:
        pprint.pprint(item.group(0))

if __name__ == "__main__":
    ParseFile("testre")

It depends on what sort of parsing you're doing.

If the parsing you're doing is linewise, you can iterate over the lines of a file with:

with open("/some/path") as f:
    for line in f:
        parse(line)

otherwise, you'll need to use something like chunking, by reading chunks at a time and parsing them. Obviously this would involve being much more careful in case what you're trying to match overlaps with chunk boundaries.

To elaborate on Julian's solution, you could achieve chunking (if you want to do multiline regexes) by storing and concatenating consecutive lines, like so:

list_prev_lines = []
for i in range(N):
    list_prev_lines.append(f.readline())
for line in f:
    list_prev_lines.pop(0)
    list_prev_lines.append(line)
    parse(string.join(list_prev_lines))

This will keep a running list of the previous N lines, the current line included, and then parse the multi-line group as a single string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM