简体   繁体   English

使用正则表达式进行Python mmap替换(替代)

[英]Python mmap replace (substitute) using regex

All you Python wizards, 您所有的Python向导,

I am reading a huge file (up to 8GB) into memory using mmap, and I want to replace some strings using regular expressions, then saving it. 我正在使用mmap将一个巨大的文件(最大8GB)读入内存,我想使用正则表达式替换一些字符串,然后保存它。 How to achieve that? 如何实现呢?

    >>> import mmap
    >>> import re
    >>> f = open('lorem.txt', 'r+')
    >>> m = mmap.mmap(f.fileno(), 0)
    >>> m.size()
    737

The issue I am having is that the replacement string is shorter than the replaced one, so when I try to run the substitution, I get an error message IndexError: mmap slice assignment is wrong size . 我遇到的问题是替换字符串比替换字符串短,因此当我尝试运行替换字符串时,出现错误消息IndexError: mmap slice assignment is wrong size

    >>> m[:] = re.sub('[Ll]orem', 'a', m[:])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IndexError: mmap slice assignment is wrong size

If I try; 如果我尝试;

    >>> a = re.sub('[Ll]orem','a', m[:])
    >>> len(a)
    733
    >>> m.write(a)
    >>> m.flush(0,len(a))
    >>> m.size()
    737

As you can see, the mapped file m still has the same size. 如您所见,映射文件m仍然具有相同的大小。 Which means it's not the same as the substituted text! 这意味着它与替代文本不同!

Any help is much appreciated. 任何帮助深表感谢。 Thanks. 谢谢。

it turns out that mmap() cannot be used to increase (or decrease) the size of a file. 事实证明mmap()不能用于增加(或减小)文件的大小。 mmap()'s function is to memory map a portion of a file. mmap()的功能是内存映射文件的一部分。 the easiest way is to truncate the file size to the new size before closing it: 最简单的方法是在关闭文件之前将其截断为新大小:

>>> f.truncate(len(a))
>>> f.close()

and if you think that the file size would increase after the replacing, then just increase its size (for example double it) after opening it: 如果您认为替换后文件的大小会增加,则只需在打开文件后增加文件大小(例如,将文件大小增加一倍)即可:

>>> f = open('lorem.txt', 'r+')
>>> f.truncate(os.path.getsize('lorem.txt') * 2)
>>> m = mmap.mmap(f.fileno(), 0)
>>> m.size()
1474

You have to re-write the file if you intend to replace a section with a length different than the beginning length. 如果要替换长度与开始长度不同的段,则必须重新写入文件。 At least from the beginning of the string to the end of the file. 至少从字符串的开头到文件的结尾。

Consider using collections of smaller files or another format that allows for variable lengths that can be interpreted by whatever process eventually reads that file. 考虑使用较小文件的集合或允许可变长度的另一种格式,该可变长度可由最终读取该文件的任何过程解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM