I am trying to get the source code of a php web page with a proxy, but it is showing not printable characters. The output I got is as follows:
"Date: Tue, 09 Feb 2016 10:29:14 GMT Server: Apache/2.4.9 (Unix) OpenSSL/1.0.1g PHP/5.5.11 mod_perl/2.0.8-dev Perl/v5.16.3 X-Powered-By: PHP/5.5.11 Set-Cookie: PHPSESSID=jmqasueos33vqoe6dbm3iscvg0; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Content-Encoding: gzip Vary: Accept-Encoding Content-Length: 577 Keep-Alive: timeout=5, max=99 Connection: Keep-Alive Content-Type: text/html �TMo�@�G����7�)P�H�H�DS��=U�=�U�]˻��_�Ycl�T�*�>��eg�� ����Z� �V�N�f�:6�ԫ�IkZ77�A��nG�W��ɗ���RGY��Oc`-ο�ƜO��~?�V��$� �l4�+���n�].W��TLJSx�/|�n��#���>��r����;�l����H��4��f�\ �SY�y��7��"
how to decode this code using python, i tried to use
decd=zlib.decompress(data, 16+zlib.MAX_WBITS)
but is not giving the decoded data
The proxy which i am using is working fine for few other web applications. It showing non printable characters for some web applications, how to decode this?
As I am using proxy I dont want to use get() and urlopen() or any other requests from python program.
One obvious way to do this is to extract the compressed data from the response and decompress it using GzipFile().read()
. This method of splitting the response might be prone to failure, but here it goes:
from gzip import GzipFile
from StringIO import StringIO
http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'
body = http.split('\r\n\r\n', 1)[1]
print GzipFile(fileobj=StringIO(body)).read()
Output
{ "gzipped": true, "headers": { "Host": "httpbin.org" }, "method": "GET", "origin": "192.168.1.1" }
If you feel compelled to parse the full HTTP response message, then, as inspired by this answer , here is a rather roundabout way to do it which involves constructing a httplib.HTTPResponse
directly from the raw HTTP response, using that to create a urllib3.response.HTTPResponse
, and then accessing the decompressed data:
import httplib
from cStringIO import StringIO
from urllib3.response import HTTPResponse
http = 'HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Tue, 09 Feb 2016 12:02:25 GMT\r\nContent-Type: application/json\r\nContent-Length: 115\r\nConnection: close\r\nContent-Encoding: gzip\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n\x1f\x8b\x08\x00\xa0\xda\xb9V\x02\xff\xab\xe6RPPJ\xaf\xca,(HMQ\xb2R()*M\xd5Q\x00\x89e\xa4&\xa6\xa4\x16\x15\x03\xc5\xaa\x81\\\xa0\x80G~q\t\x90\xa7\x94QRR\x90\x94\x99\xa7\x97_\x94\xae\x04\x94\xa9\x85(\xcfM-\xc9\xc8\x07\x99\xa0\xe4\xee\x1a\xa2\x04\x11\xcb/\xcaL\xcf\xcc\x03\x89\x19Z\x1a\xe9\x19\x9aY\xe8\x19\xea\x19*q\xd5r\x01\x00\r(\xafRu\x00\x00\x00'
class DummySocket(object):
def __init__(self, data):
self._data = StringIO(data)
def makefile(self, *args, **kwargs):
return self._data
response = httplib.HTTPResponse(DummySocket(http))
response.begin()
response = HTTPResponse.from_httplib(response)
print(response.data)
Output
{ "gzipped": true, "headers": { "Host": "httpbin.org" }, "method": "GET", "origin": "192.168.1.1" }
Although gzip
uses zlib
, when Content-Encoding
is set to gzip
, there is an additional header before the compressed stream which is not properly interpreted by the zlib.decompress
call.
Put your data in a file-like
object and pass it through the gzip
module. For example something like:
mydatafile = cStringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=mydatafile)
decdata = gzipper.read()
From my already old http library for Python 2.x
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.