简体   繁体   中英

What's the fastest way to save/load a large collection (list/set) of strings in Python 3.6?

The file is 5gb long.

I did find a similar question on stackoverflow where people suggest the use of a numpy array but I suppose this solution would be applicable to a collection of numbers and not strings.

would there be anything that beats eval(list.txt) or importing a python file with a variable set to the list?

what is the most efficient way to load/save a python list of strings?

For the read-only case:

import numpy as np

class IndexedBlob:
    def __init__(self, filename):
        index_filename = filename + '.index'
        blob = np.memmap(filename, mode='r')

        try:
            # if there is an existing index
            indices = np.memmap(index_filename, dtype='>i8', mode='r')
        except FileNotFoundError:
            # else, create it
            indices, = np.where(blob == ord('\n'))
            # force dtype to predictable file
            indices = np.array(indices, dtype='>i8')
            with open(index_filename, 'wb') as f:
                # add a virtual newline
                np.array(-1, dtype='>i8').tofile(f)
                indices.tofile(f)
            # then reopen it as a file to reduce memory
            # (and also pick up that -1 we added)
            indices = np.memmap(index_filename, dtype='>i8', mode='r')

        self.blob = blob
        self.indices = indices

    def __getitem__(self, line):
        assert line >= 0

        lo = self.indices[line] + 1
        hi = self.indices[line + 1]

        return self.blob[lo:hi].tobytes().decode()

Some additional notes:

  • Adding new strings at the end (by simply opening the file in append mode and writing a line - but beware of previous broken writes) is easy - but remember to manually update the index file too. But note that you'll need to re- mmap if you want to see it for existing IndexedBlob objects. You could avoid that and simply keep a list of "loose" objects.
  • By design, if the last line is missing a newline, it is ignored (to detect truncation or concurrent writing)
  • You could significantly shrink the size of the index by only storying every n th newline, then doing a linear search at lookup time. I found this not worth it, however.
  • If you use separate indices for the start and end, you are no longer constrained to storing the strings in order, which opens up several possibilities for mutation. But if mutation is rare, rewriting the whole file and regenerating the index isn't too expensive.
  • Consider using '\\0' as your separator instead of '\\n .

And of course:

  • General concurrent mutation is hard no matter what you do. If you need to do anything complicated, use a real database: it is the simplest solution at that point.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM