在python中的大型文本文件（> 1GB）上查找特定字符串的最有效方法是什么？

Question

I am developing a string filter for huge process log file in distributed system. 我正在为分布式系统中的巨大进程日志文件开发一个字符串过滤器。

These log files are >1GB and contains millions of lines.These logs contains special type of message blocks which are starting from "SMsg{" and end from "}". 这些日志文件的大小超过1GB ，包含数百万行。这些日志包含特殊类型的消息块，这些消息块以“ SMsg {”开头，以“}”结尾。 My program is reading the whole file line by line and put the line numbers which the line contains "SMsg{" to an list.Here is my python method to do that. 我的程序正在逐行读取整个文件，并将包含“ SMsg {”的行的行号放到列表中。这是我的python方法。

 def FindNMsgStart(self,logfile):

        self.logfile = logfile

        lf = LogFilter()

        infile = lf.OpenFile(logfile, 'Input')
        NMsgBlockStart = list()


        for num, line in enumerate(infile.readlines()):
            if re.search('SMsg{', line):                
                NMsgBlockStart.append(num)


        return NMsgBlockStart

This is my lookup function to search any kind of word in the text file. 这是我的查找功能，用于搜索文本文件中的任何单词。

def Lookup(self,infile,regex,start,end):

        self.infile = infile
        self.regex = regex
        self.start = start
        self.end = end
        result = 0


        for num, line in enumerate(itertools.islice(infile,start,end)):
            if re.search(regex, line):
                result = num + start
                break




        return result

Then I will get that list and find the end for each starting block through the whole file. 然后，我将获得该列表，并找到整个文件中每个起始块的结尾。 Following is my code for find the end. 以下是我寻找结束的代码。

def FindNmlMsgEnd(self,logfile,NMsgBlockStart):

        self.logfile = logfile
        self.NMsgBlockStart = NMsgBlockStart

        NMsgBlockEnd = list()

        lf = LogFilter() 

        length = len(NMsgBlockStart)


        if length > 0:
            for i in range (0,length):
                start=NMsgBlockStart[i]                
                infile = lf.OpenFile(logfile, 'Input')
                lines = lf.LineCount(logfile, 'Input')
                end = lf.Lookup(infile, '}', start, lines+1)               
                NMsgBlockEnd.append(end)


            return NMsgBlockEnd
        else:
            print("There is no Normal Message blocks.")

But those method are never efficient enough to handle huge files. 但是这些方法的效率永远不足以处理大型文件。 The program is running long time without a result. 该程序长时间运行没有结果。

Is there efficient way to do this? 有没有有效的方法可以做到这一点？
If yes, How could I do this? 如果是，我该怎么做？

I am doing another filters too , But first I need to find a solution for this basic problem.I am really new to python. 我也在做另一个过滤器，但是首先我需要找到这个基本问题的解决方案。我真的是python的新手。 Please help me. 请帮我。

Answer 1

I see a couple of issues that are slowing your code down. 我看到一些使您的代码变慢的问题。

The first seems to be a pretty basic error. 第一个似乎是一个非常基本的错误。 You're calling readlines on your file in the FindNMsgStart method, which is going to read the whole file into memory and return a list of its lines. 您正在使用FindNMsgStart方法调用文件上的readlines ，该方法将把整个文件读到内存中并返回其行的列表。

You should just iterate over the lines directly by using enumerate(infile) . 您应该只使用enumerate(infile)直接遍历各行。 You do this properly in the other functions that read the file, so I suspect this is a typo or just a simple oversight. 您可以在读取文件的其他函数中正确执行此操作，因此我怀疑这是错字或只是简单的疏忽。

The second issue is a bit more complicated. 第二个问题更加复杂。 It involves the general architecture of your search. 它涉及您搜索的一般架构。

You're first scanning the file for message start lines, then searching for the end line after each start. 您首先要在文件中扫描消息的开始行，然后在每次开始后搜索结束行。 Each end-line search requires re-reading much of the file, since you need to skip all the lines that occur before the start line. 每个结束行搜索都需要重新读取文件的大部分内容，因为您需要跳过起始行之前的所有行。 It would be a lot more efficient if you could combine both searches into a single pass over the data file. 如果您可以将两个搜索合并到数据文件中一次传递，则效率会高得多。

Here's a really crude generator function that does that: 这是一个真正的原始生成器函数，可以执行此操作：

def find_message_bounds(filename):
    with open(filename) as f:
        iterator = enumerate(f)
        for start_line_no, start_line in iterator:
            if 'SMsg{' in start_line:
                for end_line_no, end_line in iterator:
                    if '}' in end_line:
                        yield start_line_no, end_line_no
                        break

This function yields start, end line number tuples, and only makes a single pass over the file. 此函数产生start, end行号元组，并且仅对文件进行一次传递。

I think you can actually implement a one-pass search using your Lookup method, if you're careful with the boundary variables you pass in to it. 我认为，如果您谨慎对待传递给它的边界变量，则实际上可以使用Lookup方法实现一遍搜索。

def FindNmlMsgEnd(self,logfile,NMsgBlockStart):

    self.logfile = logfile
    self.NMsgBlockStart = NMsgBlockStart

    NMsgBlockEnd = list()

    lf = LogFilter() 
    infile = lf.OpenFile(logfile, 'Input')
    total_lines = lf.LineCount(logfile, 'Input')

    start = NMsgBlockStart[0]
    prev_end = -1
    for next_start in NMsgBlockStart[:1]:
        end = lf.Lookup(infile, '}', start-prev_end-1, next_start-prev_end-1) + prev_end + 1
        NMsgBlockEnd.append(end)

        start = next_start
        prev_end = end

    last_end = lf.Lookup(infile, '}', start-prev_end-1, total_lines-prev_end-1) + prev_end + 1
    NMsgBlockEnd.append(last_end)

    return NMsgBlockEnd

It's possible I have an off-by-one error in there somewhere, the design of the Lookup function makes it difficult to call repeatedly. 我可能在某处出现一个错误， Lookup函数的设计使其很难重复调用。

在python中的大型文本文件（> 1GB）上查找特定字符串的最有效方法是什么？

问题描述

1 个解决方案

解决方案1
2 2015-09-19 06:36:28

在python中的大型文本文件（&gt; 1GB）上查找特定字符串的最有效方法是什么？

问题描述

1 个解决方案

解决方案1 2 2015-09-19 06:36:28

在python中的大型文本文件（> 1GB）上查找特定字符串的最有效方法是什么？

解决方案1
2 2015-09-19 06:36:28