Python重大文件解析

Question

如何使用正则表达式解析大文件（使用re模块），而不将整个文件加载到字符串（或内存）中？ 内存映射文件没有帮助，因为它们的内容无法转换为某种类型的惰性字符串。 re模块仅支持string作为内容参数。

#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>

int main(int argc, char* argv[])
{
    boost::iostreams::mapped_file fl("BigFile.log");
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
    boost::regex expr("something usefull");
    boost::match_flag_type flags = boost::match_default;
    boost::iostreams::mapped_file::iterator start, end;
    start = fl.begin();
    end = fl.end();
    boost::match_results<boost::iostreams::mapped_file::iterator> what;
    while(boost::regex_search(start, end, what, expr))
    {
        std::cout<<what[0].str()<<std::endl;
        start = what[0].second;
    }
    return 0;
}

证明我的要求。 我使用C ++（和boost）编写了一个简短的示例，我希望在Python中使用它。

Answer 1

现在一切正常（Python 3.2.3与界面中的Python 2.7有一些差别）。 搜索模式应该以b“为前缀，以便有一个可行的解决方案（在Python 3.2.3中）。

import re
import mmap
import pprint

def ParseFile(fileName):
    f = open(fileName, "r")
    print("File opened succesfully")
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    print("File mapped succesfully")
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
    for item in items:
        pprint.pprint(item.group(0))

if __name__ == "__main__":
    ParseFile("testre")

Answer 2

这取决于你正在做什么样的解析。

如果您正在进行的解析是linewise，则可以使用以下内容迭代文件的行：

with open("/some/path") as f:
    for line in f:
        parse(line)

否则，你需要使用像chunking这样的东西，一次读取块并解析它们。 显然，这将涉及更加小心，以防您尝试匹配与块边界重叠。

Answer 3

要详细说明Julian的解决方案，你可以通过存储和连接连续的行来实现分块（如果你想做多行正则表达式），如下所示：

list_prev_lines = []
for i in range(N):
    list_prev_lines.append(f.readline())
for line in f:
    list_prev_lines.pop(0)
    list_prev_lines.append(line)
    parse(string.join(list_prev_lines))

这将保留前N行的运行列表，包括当前行，然后将多行组解析为单个字符串。

Python重大文件解析

问题描述

3 个解决方案

解决方案1
7 2012-07-27 16:44:15

解决方案2
6 2012-07-26 17:06:45

解决方案3
1 2012-07-26 17:15:48

Python重大文件解析

问题描述

3 个解决方案

解决方案1 7 2012-07-27 16:44:15

解决方案2 6 2012-07-26 17:06:45

解决方案3 1 2012-07-26 17:15:48

解决方案1
7 2012-07-27 16:44:15

解决方案2
6 2012-07-26 17:06:45

解决方案3
1 2012-07-26 17:15:48