如何在python中解码使用gzip压缩的源代码

Question

我正在尝试通过代理获取php网页的源代码，但是它显示的不是可打印字符。 我得到的输出如下：

"Date: Tue, 09 Feb 2016 10:29:14 GMT
Server: Apache/2.4.9 (Unix) OpenSSL/1.0.1g PHP/5.5.11 mod_perl/2.0.8-dev Perl/v5.16.3
X-Powered-By: PHP/5.5.11
Set-Cookie: PHPSESSID=jmqasueos33vqoe6dbm3iscvg0; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 577
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html

�TMo�@�G����7�)P�H�H�DS��=U�=�U�]˻��_�Ycl�T�*�>��eg��
                                                          ����Z�
                                                                �V�N�f�:6�ԫ�IkZ77�A��nG�W��ɗ���RGY��Oc`-ο�ƜO��~?�V��$�
                            �l4�+���n�].W��TǇSx�/|�n��#���>��r����;�l����H��4��f�\  �SY�y��7��"

如何使用python解码此代码，我尝试使用

decd=zlib.decompress(data, 16+zlib.MAX_WBITS)

但没有给出解码后的数据

我正在使用的代理对于其他一些Web应用程序也能正常工作。 它显示了某些Web应用程序的不可打印字符，该如何解码？

当我使用代理服务器时，我不想使用get（）和urlopen（）或来自python程序的任何其他请求。

Answer 1

一种明显的方法是从响应中提取压缩数据，然后使用GzipFile().read()其解压缩。 这种拆分响应的方法可能易于失败，但是可以这样：

from gzip import GzipFile
from StringIO import StringIO

http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'

body = http.split('\r\n\r\n', 1)[1]
print GzipFile(fileobj=StringIO(body)).read()

产量

{
  "gzipped": true, 
  "headers": {
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "192.168.1.1"
}

如果您被迫解析完整的HTTP响应消息，那么，受此答案的启发，这是一种相当httplib.HTTPResponse方法，它涉及直接从原始HTTP响应构造一个httplib.HTTPResponse ，并使用该方法创建urllib3.response.HTTPResponse ，然后访问解压缩的数据：

import httplib
from cStringIO import StringIO
from urllib3.response import HTTPResponse

http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'

class DummySocket(object):
    def __init__(self, data):
        self._data = StringIO(data)
    def makefile(self, *args, **kwargs):
        return self._data

response = httplib.HTTPResponse(DummySocket(http))
response.begin()
response = HTTPResponse.from_httplib(response)
print(response.data)

产量

{
  "gzipped": true, 
  "headers": {
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "192.168.1.1"
}

Answer 2

尽管gzip使用zlib ，但是当Content-Encoding设置为gzip ，压缩流之前还有一个附加头，而zlib.decompress调用无法正确解释该头。

将数据放在类似file-like对象中，然后通过gzip模块传递。 例如：

mydatafile = cStringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=mydatafile)
decdata = gzipper.read()

来自我已经很旧的Python 2.x的http库

https://github.com/mementum/httxlib/blob/master/httxlib/httxcompression.py

如何在python中解码使用gzip压缩的源代码

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-02-09 10:57:07

解决方案2
0 2016-02-09 11:37:47

如何在python中解码使用gzip压缩的源代码

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-02-09 10:57:07

解决方案2 0 2016-02-09 11:37:47

解决方案1
1 已采纳 2016-02-09 10:57:07

解决方案2
0 2016-02-09 11:37:47