简体   繁体   English

在 Python 中,如何解码 GZIP 编码?

[英]In Python, how do I decode GZIP encoding?

I downloaded a webpage in my python script.我在我的 python 脚本中下载了一个网页。 In most cases, this works fine.在大多数情况下,这可以正常工作。

However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.但是,这个有一个响应头:GZIP 编码,当我试图打印这个网页的源代码时,它在我的 putty 中有所有符号。

How do decode this to regular text?如何将其解码为常规文本?

I use zlib to decompress gzipped content from web.我使用 zlib 解压缩来自网络的 gzip 压缩内容。

import zlib
import urllib

f=urllib.request.urlopen(url) 
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

Decompress your byte stream using the built-in gzip module.使用内置的 gzip 模块解压缩字节流。

If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])如果您有任何问题,请显示您使用的确切最小代码、确切的错误消息和回溯,以及print repr(your_byte_stream[:100])的结果

Further information更多信息

1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article . 1.有关 gzip/zlib/deflate 混淆的解释,请阅读此 Wikipedia 文章的“其他用途”部分。

2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. 2.如果你有一个字符串而不是一个文件,那么使用 zlib 模块比使用 gzip 模块更容易。 Unfortunately the Python docs are incomplete/wrong:不幸的是, Python 文档不完整/错误:

zlib.decompress(string[, wbits[, bufsize]]) zlib.decompress(string[, wbits[, bufsize]])

...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. ... wbits 的绝对值是压缩数据时使用的历史缓冲区大小(“窗口大小”)的以 2 为底的对数。 Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage.对于最新版本的 zlib 库,它的绝对值应该在 8 到 15 之间,较大的值会导致更好的压缩,但会消耗更多的内存。 The default value is 15. When wbits is negative, the standard gzip header is suppressed;默认值为 15。当 wbits 为负数时,标准 gzip 头被抑制; this is an undocumented feature of the zlib library, used for compatibility with unzip's compression file format.这是 zlib 库的一个未记录的功能,用于与 unzip 的压缩文件格式兼容。

Firstly, 8 <= log2_window_size <= 15, with the meaning given above.首先,8 <= log2_window_size <= 15,含义如上。 Then what should be a separate arg is kludged on top:然后应该是一个单独的 arg 被挤在上面:

arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate"). arg == log2_window_size 表示假设字符串是 zlib 格式(RFC 1950;HTTP 1.1 RFC 2616 混淆地称为“deflate”)。

arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented) arg == -log2_window_size 表示假设字符串为 deflate 格式(RFC 1951;没有仔细阅读 HTTP 1.1 RFC 的人实际实现了什么)

arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). arg == 16 + log_2_window_size 表示假设字符串为 gzip 格式(RFC 1952)。 So you can use 31.所以你可以使用 31。

The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits .上述信息记录在zlib C 库手册中... Ctrl-F 搜索windowBits

For Python 3对于 Python 3

Try out this:试试这个:

import gzip

fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

I use something like that:我使用类似的东西:

f = urllib2.urlopen(request)
data = f.read()
try:
    from cStringIO import StringIO
    from gzip import GzipFile
    data2 = GzipFile('', 'r', 0, StringIO(data)).read()
    data = data2
except:
    #print "decompress error %s" % err
    pass
return data

If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.如果您使用Requests模块,那么您不需要使用任何其他模块,因为gzipdeflate传输编码会自动为您解码

Example:例子:

>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...

The .text property of the response is for reading the content in the text context.响应.text属性用于读取文本上下文中的内容。

The .content property of the response is for reading the content in the binary context.响应.content属性用于读取二进制上下文中的内容。

See the Binary Response Content section on docs.python-requests.org请参阅docs.python-requests.org上的二进制响应内容部分

Similar to Shatu's answer for python3, but arranged a little differently:类似于 Shatu 对 python3 的回答,但排列方式略有不同:

import gzip

s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())

This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data.此方法允许将 gzip.decompress() 包装在 try/except 中以捕获并传递 OSError,这会导致您可能获得混合的压缩和未压缩数据。 Some small strings actually get bigger if they are encoded, so the plain data is sent instead.如果对一些小字符串进行编码,它们实际上会变大,因此会发送纯数据。

This version is simple and avoids reading the whole file first by not calling the read() method.这个版本很简单,并且通过不调用read()方法来避免首先读取整个文件。 It provides a file stream like object instead that behaves just like a normal file stream.它提供了一个类似对象的文件流,而不是像普通文件流一样行为。

import gzip
from urllib.request import urlopen

my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')

None of these answers worked out of the box using Python 3. Here is what worked for me to fetch a page and decode the gzipped response:这些答案都没有使用 Python 3 开箱即用。以下是我获取页面并解码 gzipped 响应的方法:

import requests
import gzip

response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data)  # decoded contents of page

您可以使用 urllib3 轻松解码 gzip。

urllib3.response.decode_gzip(response.data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM