[英]What is the most efficient way to find the particular strings on huge text files (>1GB) in python?
I am developing a string filter for huge process log file in distributed system. 我正在为分布式系统中的巨大进程日志文件开发一个字符串过滤器。
These log files are >1GB and contains millions of lines.These logs contains special type of message blocks which are starting from "SMsg{" and end from "}". 这些日志文件的大小超过1GB ,包含数百万行。这些日志包含特殊类型的消息块,这些消息块以“ SMsg {”开头,以“}”结尾。 My program is reading the whole file line by line and put the line numbers which the line contains "SMsg{" to an list.Here is my python method to do that. 我的程序正在逐行读取整个文件,并将包含“ SMsg {”的行的行号放到列表中。这是我的python方法。
def FindNMsgStart(self,logfile):
self.logfile = logfile
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
NMsgBlockStart = list()
for num, line in enumerate(infile.readlines()):
if re.search('SMsg{', line):
NMsgBlockStart.append(num)
return NMsgBlockStart
This is my lookup function to search any kind of word in the text file. 这是我的查找功能,用于搜索文本文件中的任何单词。
def Lookup(self,infile,regex,start,end):
self.infile = infile
self.regex = regex
self.start = start
self.end = end
result = 0
for num, line in enumerate(itertools.islice(infile,start,end)):
if re.search(regex, line):
result = num + start
break
return result
Then I will get that list and find the end for each starting block through the whole file. 然后,我将获得该列表,并找到整个文件中每个起始块的结尾。 Following is my code for find the end. 以下是我寻找结束的代码。
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
length = len(NMsgBlockStart)
if length > 0:
for i in range (0,length):
start=NMsgBlockStart[i]
infile = lf.OpenFile(logfile, 'Input')
lines = lf.LineCount(logfile, 'Input')
end = lf.Lookup(infile, '}', start, lines+1)
NMsgBlockEnd.append(end)
return NMsgBlockEnd
else:
print("There is no Normal Message blocks.")
But those method are never efficient enough to handle huge files. 但是这些方法的效率永远不足以处理大型文件。 The program is running long time without a result. 该程序长时间运行没有结果。
I am doing another filters too , But first I need to find a solution for this basic problem.I am really new to python. 我也在做另一个过滤器,但是首先我需要找到这个基本问题的解决方案。我真的是python的新手。 Please help me. 请帮我。
I see a couple of issues that are slowing your code down. 我看到一些使您的代码变慢的问题。
The first seems to be a pretty basic error. 第一个似乎是一个非常基本的错误。 You're calling readlines
on your file in the FindNMsgStart
method, which is going to read the whole file into memory and return a list of its lines. 您正在使用FindNMsgStart
方法调用文件上的readlines
,该方法将把整个文件读到内存中并返回其行的列表。
You should just iterate over the lines directly by using enumerate(infile)
. 您应该只使用enumerate(infile)
直接遍历各行。 You do this properly in the other functions that read the file, so I suspect this is a typo or just a simple oversight. 您可以在读取文件的其他函数中正确执行此操作,因此我怀疑这是错字或只是简单的疏忽。
The second issue is a bit more complicated. 第二个问题更加复杂。 It involves the general architecture of your search. 它涉及您搜索的一般架构。
You're first scanning the file for message start lines, then searching for the end line after each start. 您首先要在文件中扫描消息的开始行,然后在每次开始后搜索结束行。 Each end-line search requires re-reading much of the file, since you need to skip all the lines that occur before the start line. 每个结束行搜索都需要重新读取文件的大部分内容,因为您需要跳过起始行之前的所有行。 It would be a lot more efficient if you could combine both searches into a single pass over the data file. 如果您可以将两个搜索合并到数据文件中一次传递,则效率会高得多。
Here's a really crude generator function that does that: 这是一个真正的原始生成器函数,可以执行此操作:
def find_message_bounds(filename):
with open(filename) as f:
iterator = enumerate(f)
for start_line_no, start_line in iterator:
if 'SMsg{' in start_line:
for end_line_no, end_line in iterator:
if '}' in end_line:
yield start_line_no, end_line_no
break
This function yields start, end
line number tuples, and only makes a single pass over the file. 此函数产生start, end
行号元组,并且仅对文件进行一次传递。
I think you can actually implement a one-pass search using your Lookup
method, if you're careful with the boundary variables you pass in to it. 我认为,如果您谨慎对待传递给它的边界变量,则实际上可以使用Lookup
方法实现一遍搜索。
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
total_lines = lf.LineCount(logfile, 'Input')
start = NMsgBlockStart[0]
prev_end = -1
for next_start in NMsgBlockStart[:1]:
end = lf.Lookup(infile, '}', start-prev_end-1, next_start-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(end)
start = next_start
prev_end = end
last_end = lf.Lookup(infile, '}', start-prev_end-1, total_lines-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(last_end)
return NMsgBlockEnd
It's possible I have an off-by-one error in there somewhere, the design of the Lookup
function makes it difficult to call repeatedly. 我可能在某处出现一个错误, Lookup
函数的设计使其很难重复调用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.