简体   繁体   English

gzip.open()的大小参数.read()

[英]The size parameter for gzip.open().read()

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this: 在Python中使用gzip库时,我经常遇到使用.read()函数的代码,其模式如下所示:

with gzip.open(filename) as bytestream:
    bytestream.read(16) 
    buf = bytestream.read(
        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
    )
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all. 虽然我熟悉上下文管理器模式,但我很难真正理解with上下文管理器的第一行代码在做什么。

This is the documentation for the read() function: 这是read()函数的文档:

Read at most n characters from stream. 从流中读取最多n个字符。

Read from underlying buffer until we have n characters or we hit EOF. 从底层缓冲区读取,直到我们有n个字符或我们点击EOF。 If n is negative or omitted, read until EOF. 如果n为负数或省略,则读取直至EOF。

If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. 如果是这种情况,第一行bytestream.read(16)的功能角色必须是读取并因此跳过前16个字符,可能是因为它们充当元数据或标题。 However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64? 但是,当我有一些图像时,我怎么知道使用16作为read调用的参数,而不是说32或8或64?

I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. 我记得有很多时间遇到完全相同的代码,除了让作者使用bytestream.read(8)而不是bytestream.read(16)或者其他任何值。 Digging into the file character-by-character show no discernible pattern to determine the length of the header character. 逐个字符地挖掘文件显示没有可识别的模式来确定标题字符的长度。

In other words, how do one determine the parameter to be used in the read function call? 换句话说,如何确定在read函数调用中使用的参数? or how do one know the length of the header characters in a gzip-compressed file? 或者如何知道gzip压缩文件中标题字符的长度?

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that. 我的猜测是它与字节有关,但在搜索完文档和在线参考后我无法确认。

Reproducible details 可重复的细节

My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. 经过无数小时的故障排除后,我的假设是前16个字符代表某种标题或元数据。 So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf . 因此,该代码中的第一行是跳过16个字符并将剩余的存储在名为buf的变量中。 However, digging into the data I found no way to determine why or how the value 16 was chosen. 然而,挖掘数据我发现无法确定选择值16的原因或方式。 I have read the bytes in character by character, and also tried reading + casting them as np.float , but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th. 我已经np.float读取了字节,并且尝试将它们作为np.float读取+铸造,但是没有可辨别的模式表明元数据在第16个字符处结束而实际数据在17日开始。

The following code reads the data from this website and extracts the first 30 characters. 以下代码从此网站读取数据并提取前30个字符。 Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \\x1c`) and the data begins: 请注意,标题行“结束”(第16次显然是在第二次出现\\ x1c`之后)并且数据开始时难以辨认:

import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
    with gzip.open(filename) as bytestream:
        first30 = bytestream.read(30)
        return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

If we modify the code to cast them as np.float32 , such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins. 如果我们修改代码将它们转换为np.float32 ,这样所有字符现在都是数字(浮点数),那么再次没有明显的模式可以区分标题/元数据的结束位置和数据的开始位置。

Any reference or advice would be very appreciated! 任何参考或建议将非常感谢!

From gzip's perspective, everything it's returning to you is data. 从gzip的角度来看,它返回给你的一切都是数据。 There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero. 没有元数据或预先考虑到数据流的特定的gzip头内容,所以没有必要对任何一种算法找出多少内容gzip的是如何预先考虑到流:它预先考虑的字节数是零。


Scroll down to the bottom of the page you linked; 向下滚动到您链接的页面底部; there's a header titled FILE FORMATS FOR THE MNIST DATABASE . 有一个名为“MNIST DATABASE的文件格式”的标题

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. 该格式规范确切地告诉您格式是什么,因此每个标头使用了多少字节。 Specifically, the first four items in each file are described as follows: 具体来说,每个文件中的前四项描述如下:

0000     32 bit integer  0x00000803(2051) magic number 
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns 

Thus, if you want to skip all four of those items, you would take 16 bytes off the top. 因此,如果您想跳过所有这四个项目,您将从顶部删除16个字节。

From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. 从代码片段中, bytestream.read(16)读取或跳过字节流的前16个字节。 When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes. 当你引用read()从流中读取最多n个字符时,它会这样做,但是看起来python在1个字节中存储一个字符,使16个字符占用16个字节。

See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data 有关字符和字节的更多信息,请访问https://pymotw.com/3/gzip/#reading-compressed-data

The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. 代码片段主要对buf的内容感兴趣,跳过流的前16个字节。 To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. 要了解如何确定进入第一个bytestream.read()的参数,AKA确定要跳过的压缩图像文件的字节数,我们必须了解其余代码的作用。 Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?). 特别是,我们正在阅读什么文件以及我们尝试使用numpy(?)库(在1D numpy数组中保存rgb图像?)。

I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. 我绝对不是图像处理方面的专家,但似乎bytestream.read(16)是处理某些独特压缩图像文件的独特问题的独特解决方案。 Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet. 因此,很难说如何确定跳过多少字节而不看更多代码并理解代码片段背后的更多逻辑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM