The file is 5gb long.
I did find a similar question on stackoverflow where people suggest the use of a numpy array but I suppose this solution would be applicable to a collection of numbers and not strings.
would there be anything that beats eval(list.txt) or importing a python file with a variable set to the list?
what is the most efficient way to load/save a python list of strings?
For the read-only case:
import numpy as np
class IndexedBlob:
def __init__(self, filename):
index_filename = filename + '.index'
blob = np.memmap(filename, mode='r')
try:
# if there is an existing index
indices = np.memmap(index_filename, dtype='>i8', mode='r')
except FileNotFoundError:
# else, create it
indices, = np.where(blob == ord('\n'))
# force dtype to predictable file
indices = np.array(indices, dtype='>i8')
with open(index_filename, 'wb') as f:
# add a virtual newline
np.array(-1, dtype='>i8').tofile(f)
indices.tofile(f)
# then reopen it as a file to reduce memory
# (and also pick up that -1 we added)
indices = np.memmap(index_filename, dtype='>i8', mode='r')
self.blob = blob
self.indices = indices
def __getitem__(self, line):
assert line >= 0
lo = self.indices[line] + 1
hi = self.indices[line + 1]
return self.blob[lo:hi].tobytes().decode()
Some additional notes:
mmap
if you want to see it for existing IndexedBlob
objects. You could avoid that and simply keep a list of "loose" objects. n
th newline, then doing a linear search at lookup time. I found this not worth it, however. '\\0'
as your separator instead of '\\n
. And of course:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.