[英]mmap in python printing binary data instead of text
I am trying to read a big file of 30 MB character by character. 我正在尝试逐字符读取30 MB的大文件。 I found an interesting article on how to read a big file.
我找到了一篇有关如何读取大文件的有趣文章。 Fast Method to Stream Big files
快速传输大文件的方法
Problem: Output printing binary data instead of actual human readable text 问题:输出打印二进制数据,而不是实际的人类可读文本
Code: 码:
def getRow(filepath):
offsets = get_offsets(filepath)
random.shuffle(offsets)
with gzip.open(filepath, "r+b") as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
for position in offsets:
mm.seek(position)
record = mm.readline()
x = record.split(",")
yield x
def get_offsets(input_filename):
offsets = []
with open(input_filename, 'r+b') as f:
i = 0
mm = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
for record in iter(mm.readline, ''):
loc = mm.tell()
offsets.append(loc)
i += 1
return offsets
for line in getRow("hello.dat.gz"):
print line
Output: The output is producing some weird binary data. 输出:输出产生一些奇怪的二进制数据。
['w\xc1\xd9S\xabP8xy\x8f\xd8\xae\xe3\xd8b&\xb6"\xbeZ\xf3P\xdc\x19&H\\@\x8e\x83\x0b\x81?R\xb0\xf2\xb5\xc1\x88rJ\
Am I doing something terribly stupid? 我在做一些非常愚蠢的事情吗?
EDIT: 编辑:
I found the problem. 我发现了问题。 It is because of
gzip.open
. 这是因为
gzip.open
。 Not sure how to get rid of this. 不知道如何摆脱这一点。 Any ideas?
有任何想法吗?
As per the documentation of GZipFile
: 根据
GZipFile
的文档:
fileno(self)
Invoke the underlying file object's `fileno()` method.
You are mapping a view of the compressed .gz
file, not a view of the compressed data. 您正在映射压缩的
.gz
文件的视图,而不是压缩数据的视图。
mmap()
can only operate on OS file handles, it cannot map arbitrary Python file objects. mmap()
只能在OS文件句柄上运行,它不能映射任意Python文件对象。
So no, you cannot transparently map a decompressed view of a compressed file unless this is supported directly by the underlying operating system. 因此,您不能透明地映射压缩文件的解压缩视图,除非基础操作系统直接支持此操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.