简体   繁体   中英

I am trying to get html data from a site using urllib but for some sites i am ending up with some unknown characters in python

Hey guys i am trying to get a html data from a site using urllib.openurl.read() but for some sites all i am getting is data link this * 6\\xbdW\\xb6\\xd6\\xff\\xca\\x9d\\x9bO|\\xc0\\x96a\\xc7\\xc8\\xf7\\xa7\\x10-\\x8aM{\\xf8\\x* and i have no clue what it is and why i am getting like this. I tried googling it some said there is encoding decoding problem i tried that as well but as you can see no luck there so please guide me in this darkness. Here is my code --- >

url = "http://mangafox.me/manga/online_the_comic/c001/1.html" # for this site and some more its not working
page = urllib.urlopen(url).read()
print page

and you guys know whats happening after printing this code.

This page its on gzip format, you got to unzip before take the data:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)

0x8b in the begin of the code it means gzip format.

You should take a look in this question:

twitter trends api UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM