如何在 python 中加快解析大型日志文件的速度？

Question

I need to run a parser on .gz files that contain log files for a project I am working on.我需要对包含我正在处理的项目的日志文件的.gz文件运行解析器。 Once extracted, each log file is roughly 800MB , and each .zip file can contain up to 20 of them.提取后，每个日志文件大约为800MB ，每个.zip文件最多可以包含 20 个。

In total, I would need to parse through as much as 20GB of raw text files in a single shot.总的来说，我需要一次解析多达20GB的原始文本文件。 I have no control over the structure of the log / .gz files as these are downloaded from the company's AWS server.我无法控制log / .gz文件的结构，因为这些文件是从公司的 AWS 服务器下载的。

Within these log files, I need to look for a particular code, and if the code exists within the line, I need to extract the relevant data and save it to a csv file.在这些日志文件中，我需要查找特定代码，如果代码存在于行内，我需要提取相关数据并将其保存到csv文件中。

Right now, I am searching through line by line and as expected, a single file takes as long as 10min to complete, timed using timeit现在，我正在逐行搜索，正如预期的那样，单个文件需要长达 10 分钟才能完成，使用timeit计时

with gzip.open(file_location, 'rb') as f:
    for line in f:
        line_string = line.decode().strip()
        if self.config_dict["log_type"] in line_string:
            log.append(line_string)

Is there any way I can speed up the parsing?有什么办法可以加快解析速度吗？

Edit: To give context, this is how a single line of the log file may look like编辑：为了给出上下文，这就是日志文件的单行可能看起来像

8=FIX.4.49=28935=834=109049=TESTSELL152=20180920-18:23:53.67156=TESTBUY16=113.3511=63673064027889863414=3500.000000000015=USD17=2063673064633531000021=231=113.3532=350037=2063673064633531000038=700039=140=154=155=MSFT60=20180920-18:23:53.531150=F151=3500453=1448=BRK2447=D452=110=151 8=FIX.4.49=28935=834=109049=TESTSELL152=20180920-18:23:53.67156=TESTBUY16=113.3511=63673064027889863414=3500.000000000015=USD17=2063673064633531000021=231=113.3532=350037=2063673064633531000038=700039=140=154=155= MSFT60=20180920-18:23:53.531150=F151=3500453=1448=BRK2447=D452=110=151

Within this, I am checking for a very specific substring, lets say "155=MSFT" and if there is a match, i will add it to a certain list.在其中，我正在检查一个非常具体的子字符串，比如说“155=MSFT”，如果有匹配项，我会将其添加到某个列表中。

Answer 1

I would outsource the work to something faster than Python.我会将工作外包给比 Python 更快的东西。 zgrep(1) exists exactly for this task: zgrep(1)正是为此任务而存在的：

import subprocess

search_process = subprocess.Popen(
    ["zgrep", "-F", "--", self.config_dict["log_type"], file_location],
    stdin=subprocess.DEVNULL,
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL,
    encoding="utf-8",
)

log.extend(search_process.stdout)

if search_process.wait() != 0:
    raise Exception(f"search process failed with code {search_process.returncode}")

Answer 2

One example of processing multiple files with multiprocessing Pool使用multiprocessing Pool 处理多个文件的一个示例

from multiprocessing import Pool

def process_log(filepath: str, log_type: str) -> list[str]:
    results = []
    with gzip.open(file_location, 'r') as f:
        for line in f:
            if log_type in line:
                log.append(line.line_stringstrip())
    return results

def process_log_files(log_filepaths: list[str], log_type: str):
    args = [(filepath, log_type) for filepath in log_filepaths]
    with Pool() as pool:
        with open('output.txt', 'w', encoding='utf8') as out:
            for results in pool.starmap(process_log, args):
                for result in results:
                    out.write(result)

如何在 python 中加快解析大型日志文件的速度？

问题描述

2 个解决方案

解决方案1
1 2022-05-25 02:38:35

解决方案2
0 2022-05-25 02:31:57

如何在 python 中加快解析大型日志文件的速度？

问题描述

2 个解决方案

解决方案1 1 2022-05-25 02:38:35

解决方案2 0 2022-05-25 02:31:57

解决方案1
1 2022-05-25 02:38:35

解决方案2
0 2022-05-25 02:31:57