简体   繁体   English

Python - 解码('utf-8')问题

[英]Python - decode ('utf-8') issue

I am very new to Python.Please help me fix this issue.我是 Python 的新手。请帮我解决这个问题。

I am trying to get the revenue from the link below:我正在尝试从以下链接获取收入:

https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898 https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898

I am using below commands:我正在使用以下命令:

import re重新进口

import urllib.request导入 urllib.request

data=urllib.request.urlopen(url).read() data=urllib.request.urlopen(url).read()

data1=data.decode("utf-8") data1=data.decode("utf-8")

Issue:问题:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10798: invalid start byte UnicodeDecodeError:“utf-8”编解码器无法解码位置 10798 中的字节 0xa0:起始字节无效

Maybe better with requests:也许更好的请求:

import requests

url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text

The result of downloading the specific URL given in the question, is HTML code.下载问题中给出的特定 URL 的结果是 HTML 代码。 I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:在使用以下 Python 代码获取数据后,我能够使用 BeautifulSoup 抓取页面:

import requests

url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"

response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")

print (data)

Please note that I used Python3 in my code example.请注意,我在代码示例中使用了 Python3。 The syntax for print() may vary a little. print()的语法可能略有不同。

0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. 0xa0或 unicode 符号 U+00A0 是字符 NO-BREAK SPACE。 In UTF8 it is represented as b'\xc2\xa0' .在 UTF8 中,它表示为b'\xc2\xa0' If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.如果您发现它是原始字节,则可能意味着您的输入不是 UTF8 编码而是 Latin1 编码。

A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...快速浏览一下链接页面就会发现它确实是 latin1 编码的——但我得到的是法语版本……

The rule when you are not sure of the exact convertion is to use the replace errors processing:当您不确定确切的转换时,规则是使用替换错误处理:

data1=data.decode("utf-8", errors="replace")

then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �).然后,所有违规字符都将替换为替换字符 (U+FFFD)(显示为 �)。 If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8.如果只找到几个,则说明该页面包含错误字符,但如果几乎所有非 ascii 字符都被替换,则说明编码不是 UTF8。 If is commonly Latin1 for west european languages, but your mileage may vary for other languages.如果西欧语言通常是 Latin1,但您的里程可能因其他语言而异。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM