Python - decode ('utf-8') issue

Question

I am very new to Python.Please help me fix this issue.

I am trying to get the revenue from the link below:

https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898

I am using below commands:

import re

import urllib.request

data=urllib.request.urlopen(url).read()

data1=data.decode("utf-8")

Issue:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10798: invalid start byte

Answer 1

Maybe better with requests:

import requests

url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text

Answer 2

The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:

import requests

url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"

response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")

print (data)

Please note that I used Python3 in my code example. The syntax for print() may vary a little.

Answer 3

0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0' . If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.

A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...

The rule when you are not sure of the exact convertion is to use the replace errors processing:

data1=data.decode("utf-8", errors="replace")

then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.

Python - decode ('utf-8') issue

Question

3 answers

solution1
1 2017-07-17 15:05:21

solution2
1 2017-07-17 21:00:12

solution3
0 2017-07-17 15:43:54

Python - decode ('utf-8') issue

Question

3 answers

solution1 1 2017-07-17 15:05:21

solution2 1 2017-07-17 21:00:12

solution3 0 2017-07-17 15:43:54

solution1
1 2017-07-17 15:05:21

solution2
1 2017-07-17 21:00:12

solution3
0 2017-07-17 15:43:54