简体   繁体   English

如何重新搜索或重新匹配整个文件而不将其全部读入内存?

[英]How do I re.search or re.match on a whole file without reading it all into memory?

I want to be able to run a regular expression on an entire file, but I'd like to be able to not have to read the whole file into memory at once as I may be working with rather large files in the future.我希望能够对整个文件运行正则表达式,但我希望不必一次将整个文件读入内存,因为我将来可能会处理相当大的文件。 Is there a way to do this?有没有办法做到这一点? Thanks!谢谢!

Clarification: I cannot read line-by-line because it can span multiple lines.澄清:我无法逐行阅读,因为它可以跨越多行。

You can use mmap to map the file to memory.您可以使用 mmap 将文件映射到内存。 The file contents can then be accessed like a normal string:然后可以像普通字符串一样访问文件内容:

import re, mmap

with open('/var/log/error.log', 'r+') as f:
  data = mmap.mmap(f.fileno(), 0)
  mo = re.search('error: (.*)', data)
  if mo:
    print "found error", mo.group(1)

This also works for big files, the file content is internally loaded from disk as needed.这也适用于大文件,文件内容根据需要从磁盘内部加载。

This depends on the file and the regex.这取决于文件和正则表达式。 The best thing you could do would be to read the file in line by line but if that does not work for your situation then might get stuck with pulling the whole file into memory.您可以做的最好的事情是逐行读取文件,但如果这对您的情况不起作用,则可能会卡住将整个文件拉入内存。

Lets say for example that this is your file:例如,让我们说这是您的文件:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Ut fringilla pede blandit
eros sagittis viverra. Curabitur facilisis
urna ABC elementum lacus molestie aliquet.
Vestibulum lobortis semper risus. Etiam
sollicitudin. Vivamus posuere mauris eu
nulla. Nunc nisi. Curabitur fringilla fringilla
elit. Nullam feugiat, metus et suscipit
fermentum, mauris ipsum blandit purus,
non vehicula purus felis sit amet tortor.
Vestibulum odio. Mauris dapibus ultricies
metus. Cras XYZ eu lectus. Cras elit turpis,
ultrices nec, commodo eu, sodales non, erat.
Quisque accumsan, nunc nec porttitor vulputate,
erat dolor suscipit quam, a tristique justo
turpis at erat.

And this was your regex:这是你的正则表达式:

consectetur(?=\sadipiscing)

Now this regex uses positive lookahead and will only match a string of "consectetur" if it is immediately followed by any whitepace character and then a string of "adipiscing".现在,此正则表达式使用正向前瞻,并且仅匹配“consectetur”字符串,如果其后紧跟任何空白字符,然后是“adipiscing”字符串。

So in this example you would have to read the whole file into memory because your regex is depending on the entire file being parsed as a single string.因此,在此示例中,您必须将整个文件读入内存,因为您的正则表达式取决于将整个文件解析为单个字符串。 This is one of many examples that would require you to have your entire string in memory for a particular regex to work.这是要求您将整个字符串保存在内存中以便特定正则表达式工作的众多示例之一。

I guess the unfortunate answer is that it all depends on your situation.我想不幸的答案是这一切都取决于你的情况。

If this is a big deal and worth some effort, you can convert the regular expression into a finite state machine which reads the file.如果这是一件大事并且值得付出一些努力,您可以将正则表达式转换为读取文件的有限状态机。 The FSM can be of O(n) complexity which means it will be a lot faster as the file size gets big. FSM 可以是 O(n) 复杂度,这意味着随着文件大小变大它会快很多。

You will be able to efficiently match patterns that span lines in files too large to fit in memory.您将能够有效地匹配文件中跨行过大而无法放入内存的模式。

Here are two places that describe the algorithm for converting a regular expression to a FSM:这里有两个地方描述了将正则表达式转换为 FSM 的算法:

This is one way:这是一种方式:

import re

REGEX = '\d+'

with open('/tmp/workfile', 'r') as f:
      for line in f:
          print re.match(REGEX,line)
  1. with operator in python 2.5 takes of automatic file closure. python 2.5 中的 with 运算符需要自动关闭文件。 Hence you need not worry about it.因此,您无需担心。
  2. iterator over the file object is memory efficient.文件对象上的迭代器是内存高效的。 that is it wont read more than a line of memory at a given time.也就是说,它在给定时间不会读取超过一行内存。
  3. But the draw back of this approach is that it would take a lot of time for huge files.但是这种方法的缺点是大文件需要很多时间。

Another approach which comes to my mind is to use read(size) and file.seek(offset) method, which will read a portion of the file size at a time.我想到的另一种方法是使用 read(size) 和 file.seek(offset) 方法,这将一次读取文件大小的一部分。

import re

REGEX = '\d+'

with open('/tmp/workfile', 'r') as f:
      filesize = f.size()
      part = filesize / 10 # a suitable size that you can determine ahead or in the prog.
      position = 0 
      while position <= filesize: 
          content = f.read(part)
          print re.match(REGEX,content)
          position = position + part
          f.seek(position)

You can also combine these two there you can create generator that would return contents a certain bytes at the time and iterate through that content to check your regex.您还可以将这两者结合起来,您可以创建生成器,该生成器将返回特定字节的内容并遍历该内容以检查您的正则表达式。 This IMO would be a good approach.这个 IMO 将是一个很好的方法。

Here's an option for you using re and mmap to find all the words in a file that doesn't build lists or load the whole file into memory.这里有一个选项供您使用 re 和 mmap 来查找文件中不构建列表或将整个文件加载到内存中的所有单词。

import re
from contextlib import closing
from mmap import mmap, ACCESS_READ

with open('filepath.txt', 'r') as f:
    with closing(mmap(f.fileno(), 0, access=ACCESS_READ)) as d:
        print(sum(1 for _ in re.finditer(b'\w+', d)))

based on @sth's answer but less memory usage基于@sth 的答案,但内存使用量较少

f = open(filename,'r')
  for eachline in f:
    string=re.search("(<tr align=\"right\"><td>)([0-9]*)(</td><td>)([a-zA-Z]*)(</td><td>)([a-zA-Z]*)(</td>)",eachline)
    if string:
      for i in range (2,8,2):
        add = string.group(i)
        l.append(add)

对于单行模式,您可以遍历文件的行,但对于多行模式,您必须将文件的全部(或部分,但很难跟踪)读入内存。

Open the file and iterate over the lines.打开文件并遍历行。

fd = open('myfile')
for line in fd:
    if re.match(...,line)
        print line

Python 3: To load file as one big string use read() and decode() methods Python 3:将文件作为一个大字符串加载,使用 read() 和 decode() 方法

import re, mmap


def read_search_in_file(file):
    with open('/var/log/error.log', 'r+') as f:
        data = mmap.mmap(f.fileno(), 0).read().decode("utf-8")
        error = re.search(r'error: (.*)', data)
  if error:
    return error.group(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM