I am trying to read a big file of 30 MB character by character. I found an interesting article on how to read a big file. Fast Method to Stream Big files
Problem: Output printing binary data instead of actual human readable text
Code:
def getRow(filepath):
offsets = get_offsets(filepath)
random.shuffle(offsets)
with gzip.open(filepath, "r+b") as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
for position in offsets:
mm.seek(position)
record = mm.readline()
x = record.split(",")
yield x
def get_offsets(input_filename):
offsets = []
with open(input_filename, 'r+b') as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
for record in iter(mm.readline, ''):
loc = mm.tell()
offsets.append(loc)
i += 1
return offsets
for line in getRow("hello.dat.gz"):
print line
Output: The output is producing some weird binary data.
['w\xc1\xd9S\xabP8xy\x8f\xd8\xae\xe3\xd8b&\xb6"\xbeZ\xf3P\xdc\x19&H\\@\x8e\x83\x0b\x81?R\xb0\xf2\xb5\xc1\x88rJ\
Am I doing something terribly stupid?
EDIT:
I found the problem. It is because of gzip.open
. Not sure how to get rid of this. Any ideas?
As per the documentation of GZipFile
:
fileno(self)
Invoke the underlying file object's `fileno()` method.
You are mapping a view of the compressed .gz
file, not a view of the compressed data.
mmap()
can only operate on OS file handles, it cannot map arbitrary Python file objects.
So no, you cannot transparently map a decompressed view of a compressed file unless this is supported directly by the underlying operating system.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.