[英]Why can't I decode this UTF-8 page?
Howdy folks, 你好伙计们,
I'm new to getting data from the web using python. 我是使用python从网上获取数据的新手。 I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/
我想将这个页面的源代码放在一个字符串中: https : //projects.fivethirtyeight.com/2018-nba-predictions/
The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html ): 以下代码适用于其他页面(例如https://www.basketball-reference.com/boxscores/201712090ATL.html ):
import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')
And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case) 我希望dataString是一个HTML字符串(在这个具体情况下,我的期望见下文)
<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc
Instead, for the 538 website, I get this error: 相反,对于538网站,我收到此错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). 我的研究表明问题在于我的文件实际上并没有使用UTF-8进行编码,但是页面的charset和beautiful-soup的UnicodeDammit()声称它是UTF-8(第二个可能是因为第一个)。 chardet.detect() doesn't suggest any encoding.
chardet.detect()不建议任何编码。 I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail:
我尝试在decode()的编码参数中用以下代替'UTF-8'无效:
ISO-8859-1 ISO-8859-1
latin-1 拉丁-1
Windows-1252 Windows的1252
Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. 也许值得一提的是,字节数组数据看起来并不像我期望的那样。 Here's data[:10] from a working URL:
这是来自工作网址的数据[:10]:
b'\n<!DOCTYPE'
Here's data[:10] from the 538 site: 这是来自538网站的数据[:10]:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
What's up? 这是怎么回事?
The server provided you with gzip-compressed data; 服务器为您提供了gzip压缩数据; this is not completely common, as
urllib
by default doesn't set any accept-encoding
value, so servers generally conservatively don't compress the data. 这并不常见,因为
urllib
默认情况下不设置任何accept-encoding
值,因此服务器通常保守地不压缩数据。
Still, the content-encoding
field of the response is set, so you have the way to know that your page is indeed gzip-compressed, and you can decompress it using Python gzip
module before further processing. 仍然,响应的
content-encoding
字段已设置,因此您可以知道您的页面确实是gzip压缩的,并且您可以在进一步处理之前使用Python gzip
模块对其进行解压缩。
import urllib.request
import gzip
file = urllib.request.urlopen(webAddress)
data = file.read()
if file.headers['content-encoding'].lower() == 'gzip':
data = gzip.decompress(data)
file.close()
dataString = data.decode(encoding='UTF-8')
OTOH, if you have the possibility to use the requests
module it will handle all this mess by itself, including compression (did I mention that you may also get deflate
besides gzip
, which is the same but with different headers ?) and (at least partially) encoding. OTOH,如果你有可能使用
requests
模块,它将自己处理所有这些混乱,包括压缩(我是否提到除gzip
之外你也可以deflate
,这是相同的但是有不同的标题 ?)和(至少部分)编码。
import requests
webAddress = "https://projects.fivethirtyeight.com/2018-nba-predictions/"
r = requests.get(webAddress)
print(repr(r.text))
This will perform your request and correctly print out the already-decoded Unicode string. 这将执行您的请求并正确打印出已解码的Unicode字符串。
您正在阅读gzip压缩数据: http : //www.forensicswiki.org/wiki/Gzip您必须解压缩它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.