简体   繁体   中英

Python - decode ('utf-8') issue

I am very new to Python.Please help me fix this issue.

I am trying to get the revenue from the link below:

https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898

I am using below commands:

import re

import urllib.request

data=urllib.request.urlopen(url).read()

data1=data.decode("utf-8")

Issue:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10798: invalid start byte

Maybe better with requests:

import requests

url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text

The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:

import requests

url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"

response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")

print (data)

Please note that I used Python3 in my code example. The syntax for print() may vary a little.

0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0' . If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.

A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...

The rule when you are not sure of the exact convertion is to use the replace errors processing:

data1=data.decode("utf-8", errors="replace")

then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM