在Python 3.6中保存/加载大量字符串（列表/集合）的最快方法是什么？

Question

该文件的长度为5gb。

我确实在stackoverflow上找到了一个类似的问题，人们建议使用numpy数组，但是我想这种解决方案将适用于数字而不是字符串的集合。

是否有任何东西可以击败eval（list.txt）或导入一个变量设置为列表的python文件？

加载/保存python字符串列表的最有效方法是什么？

Answer 1

对于只读情况：

import numpy as np

class IndexedBlob:
    def __init__(self, filename):
        index_filename = filename + '.index'
        blob = np.memmap(filename, mode='r')

        try:
            # if there is an existing index
            indices = np.memmap(index_filename, dtype='>i8', mode='r')
        except FileNotFoundError:
            # else, create it
            indices, = np.where(blob == ord('\n'))
            # force dtype to predictable file
            indices = np.array(indices, dtype='>i8')
            with open(index_filename, 'wb') as f:
                # add a virtual newline
                np.array(-1, dtype='>i8').tofile(f)
                indices.tofile(f)
            # then reopen it as a file to reduce memory
            # (and also pick up that -1 we added)
            indices = np.memmap(index_filename, dtype='>i8', mode='r')

        self.blob = blob
        self.indices = indices

    def __getitem__(self, line):
        assert line >= 0

        lo = self.indices[line] + 1
        hi = self.indices[line + 1]

        return self.blob[lo:hi].tobytes().decode()

一些附加说明：

在末尾添加新字符串（通过简单地在追加模式下打开文件并写一行-但要注意以前的坏写操作）很容易-但也请记住手动更新索引文件。 但是请注意，您需要重新mmap ，如果你想看到它现有IndexedBlob对象。 您可以避免这种情况，只需保留“松散”对象的列表即可。
根据设计，如果最后一行缺少换行符，则将其忽略（以检测截断或并发写入）
您可以通过仅每第n个换行记录一次，然后在查找时间进行线性搜索来显着缩小索引的大小。 我发现这不值得。
如果您对开头和结尾使用单独的索引，则不再需要按顺序存储字符串，这为突变提供了多种可能性。 但是，如果很少发生突变，则重写整个文件并重新生成索引并不太昂贵。
考虑使用'\\0'而不是'\\n作为分隔符。

而且当然：

无论您做什么，一般的并发突变都很难。 如果您需要做复杂的事情，请使用真实的数据库：这是当时最简单的解决方案。

在Python 3.6中保存/加载大量字符串（列表/集合）的最快方法是什么？

问题描述

1 个解决方案

解决方案1
0 2018-07-05 01:15:41

在Python 3.6中保存/加载大量字符串（列表/集合）的最快方法是什么？

问题描述

1 个解决方案

解决方案1 0 2018-07-05 01:15:41

解决方案1
0 2018-07-05 01:15:41