简体   繁体   中英

How to decode a source code which is compressed with gzip in python

I am trying to get the source code of a php web page with a proxy, but it is showing not printable characters. The output I got is as follows:

"Date: Tue, 09 Feb 2016 10:29:14 GMT
Server: Apache/2.4.9 (Unix) OpenSSL/1.0.1g PHP/5.5.11 mod_perl/2.0.8-dev Perl/v5.16.3
X-Powered-By: PHP/5.5.11
Set-Cookie: PHPSESSID=jmqasueos33vqoe6dbm3iscvg0; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 577
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html

�TMo�@�G����7�)P�H�H�DS��=U�=�U�]˻��_�Ycl�T�*�>��eg��
                                                          ����Z�
                                                                �V�N�f�:6�ԫ�IkZ77�A��nG�W��ɗ���RGY��Oc`-ο�ƜO��~?�V��$�
                            �l4�+���n�].W��TLJSx�/|�n��#���>��r����;�l����H��4��f�\  �SY�y��7��"

how to decode this code using python, i tried to use

decd=zlib.decompress(data, 16+zlib.MAX_WBITS)

but is not giving the decoded data

The proxy which i am using is working fine for few other web applications. It showing non printable characters for some web applications, how to decode this?

As I am using proxy I dont want to use get() and urlopen() or any other requests from python program.

One obvious way to do this is to extract the compressed data from the response and decompress it using GzipFile().read() . This method of splitting the response might be prone to failure, but here it goes:

from gzip import GzipFile
from StringIO import StringIO

http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'

body = http.split('\r\n\r\n', 1)[1]
print GzipFile(fileobj=StringIO(body)).read()

Output

{
  "gzipped": true, 
  "headers": {
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "192.168.1.1"
}

If you feel compelled to parse the full HTTP response message, then, as inspired by this answer , here is a rather roundabout way to do it which involves constructing a httplib.HTTPResponse directly from the raw HTTP response, using that to create a urllib3.response.HTTPResponse , and then accessing the decompressed data:

import httplib
from cStringIO import StringIO
from urllib3.response import HTTPResponse

http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'

class DummySocket(object):
    def __init__(self, data):
        self._data = StringIO(data)
    def makefile(self, *args, **kwargs):
        return self._data

response = httplib.HTTPResponse(DummySocket(http))
response.begin()
response = HTTPResponse.from_httplib(response)
print(response.data)

Output

{
  "gzipped": true, 
  "headers": {
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "192.168.1.1"
}

Although gzip uses zlib , when Content-Encoding is set to gzip , there is an additional header before the compressed stream which is not properly interpreted by the zlib.decompress call.

Put your data in a file-like object and pass it through the gzip module. For example something like:

mydatafile = cStringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=mydatafile)
decdata = gzipper.read()

From my already old http library for Python 2.x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM