简体   繁体   中英

Python mmap replace (substitute) using regex

All you Python wizards,

I am reading a huge file (up to 8GB) into memory using mmap, and I want to replace some strings using regular expressions, then saving it. How to achieve that?

    >>> import mmap
    >>> import re
    >>> f = open('lorem.txt', 'r+')
    >>> m = mmap.mmap(f.fileno(), 0)
    >>> m.size()
    737

The issue I am having is that the replacement string is shorter than the replaced one, so when I try to run the substitution, I get an error message IndexError: mmap slice assignment is wrong size .

    >>> m[:] = re.sub('[Ll]orem', 'a', m[:])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IndexError: mmap slice assignment is wrong size

If I try;

    >>> a = re.sub('[Ll]orem','a', m[:])
    >>> len(a)
    733
    >>> m.write(a)
    >>> m.flush(0,len(a))
    >>> m.size()
    737

As you can see, the mapped file m still has the same size. Which means it's not the same as the substituted text!

Any help is much appreciated. Thanks.

it turns out that mmap() cannot be used to increase (or decrease) the size of a file. mmap()'s function is to memory map a portion of a file. the easiest way is to truncate the file size to the new size before closing it:

>>> f.truncate(len(a))
>>> f.close()

and if you think that the file size would increase after the replacing, then just increase its size (for example double it) after opening it:

>>> f = open('lorem.txt', 'r+')
>>> f.truncate(os.path.getsize('lorem.txt') * 2)
>>> m = mmap.mmap(f.fileno(), 0)
>>> m.size()
1474

You have to re-write the file if you intend to replace a section with a length different than the beginning length. At least from the beginning of the string to the end of the file.

Consider using collections of smaller files or another format that allows for variable lengths that can be interpreted by whatever process eventually reads that file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM