[英]How do I speed up parsing large log files in python?
I need to run a parser on .gz
files that contain log files for a project I am working on.我需要对包含我正在处理的项目的日志文件的
.gz
文件运行解析器。 Once extracted, each log file is roughly 800MB
, and each .zip
file can contain up to 20 of them.提取后,每个日志文件大约为
800MB
,每个.zip
文件最多可以包含 20 个。
In total, I would need to parse through as much as 20GB
of raw text files in a single shot.总的来说,我需要一次解析多达
20GB
的原始文本文件。 I have no control over the structure of the log
/ .gz
files as these are downloaded from the company's AWS server.我无法控制
log
/ .gz
文件的结构,因为这些文件是从公司的 AWS 服务器下载的。
Within these log files, I need to look for a particular code, and if the code exists within the line, I need to extract the relevant data and save it to a csv
file.在这些日志文件中,我需要查找特定代码,如果代码存在于行内,我需要提取相关数据并将其保存到
csv
文件中。
Right now, I am searching through line by line and as expected, a single file takes as long as 10min to complete, timed using timeit
现在,我正在逐行搜索,正如预期的那样,单个文件需要长达 10 分钟才能完成,使用
timeit
计时
with gzip.open(file_location, 'rb') as f:
for line in f:
line_string = line.decode().strip()
if self.config_dict["log_type"] in line_string:
log.append(line_string)
Is there any way I can speed up the parsing?有什么办法可以加快解析速度吗?
Edit: To give context, this is how a single line of the log file may look like编辑:为了给出上下文,这就是日志文件的单行可能看起来像
8=FIX.4.49=28935=834=109049=TESTSELL152=20180920-18:23:53.67156=TESTBUY16=113.3511=63673064027889863414=3500.000000000015=USD17=2063673064633531000021=231=113.3532=350037=2063673064633531000038=700039=140=154=155=MSFT60=20180920-18:23:53.531150=F151=3500453=1448=BRK2447=D452=110=151
8=FIX.4.49=28935=834=109049=TESTSELL152=20180920-18:23:53.67156=TESTBUY16=113.3511=63673064027889863414=3500.000000000015=USD17=2063673064633531000021=231=113.3532=350037=2063673064633531000038=700039=140=154=155= MSFT60=20180920-18:23:53.531150=F151=3500453=1448=BRK2447=D452=110=151
Within this, I am checking for a very specific substring, lets say "155=MSFT" and if there is a match, i will add it to a certain list.在其中,我正在检查一个非常具体的子字符串,比如说“155=MSFT”,如果有匹配项,我会将其添加到某个列表中。
I would outsource the work to something faster than Python.我会将工作外包给比 Python 更快的东西。 zgrep(1) exists exactly for this task:
zgrep(1)正是为此任务而存在的:
import subprocess
search_process = subprocess.Popen(
["zgrep", "-F", "--", self.config_dict["log_type"], file_location],
stdin=subprocess.DEVNULL,
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
encoding="utf-8",
)
log.extend(search_process.stdout)
if search_process.wait() != 0:
raise Exception(f"search process failed with code {search_process.returncode}")
One example of processing multiple files with multiprocessing Pool使用multiprocessing Pool 处理多个文件的一个示例
from multiprocessing import Pool
def process_log(filepath: str, log_type: str) -> list[str]:
results = []
with gzip.open(file_location, 'r') as f:
for line in f:
if log_type in line:
log.append(line.line_stringstrip())
return results
def process_log_files(log_filepaths: list[str], log_type: str):
args = [(filepath, log_type) for filepath in log_filepaths]
with Pool() as pool:
with open('output.txt', 'w', encoding='utf8') as out:
for results in pool.starmap(process_log, args):
for result in results:
out.write(result)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.