为什么我无法解码这个UTF-8页面？

Question

Howdy folks, 你好伙计们，

I'm new to getting data from the web using python. 我是使用python从网上获取数据的新手。 I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/ 我想将这个页面的源代码放在一个字符串中： https ： //projects.fivethirtyeight.com/2018-nba-predictions/

The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html ): 以下代码适用于其他页面（例如https://www.basketball-reference.com/boxscores/201712090ATL.html ）：

import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')

And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case) 我希望dataString是一个HTML字符串（在这个具体情况下，我的期望见下文）

<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc

Instead, for the 538 website, I get this error: 相反，对于538网站，我收到此错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). 我的研究表明问题在于我的文件实际上并没有使用UTF-8进行编码，但是页面的charset和beautiful-soup的UnicodeDammit（）声称它是UTF-8（第二个可能是因为第一个）。 chardet.detect() doesn't suggest any encoding. chardet.detect（）不建议任何编码。 I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail: 我尝试在decode（）的编码参数中用以下代替'UTF-8'无效：

ISO-8859-1 ISO-8859-1

latin-1 拉丁-1

Windows-1252 Windows的1252

Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. 也许值得一提的是，字节数组数据看起来并不像我期望的那样。 Here's data[:10] from a working URL: 这是来自工作网址的数据[：10]：

b'\n<!DOCTYPE'

Here's data[:10] from the 538 site: 这是来自538网站的数据[：10]：

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

What's up? 这是怎么回事？

Answer 1

The server provided you with gzip-compressed data; 服务器为您提供了gzip压缩数据; this is not completely common, as urllib by default doesn't set any accept-encoding value, so servers generally conservatively don't compress the data. 这并不常见，因为urllib默认情况下不设置任何accept-encoding值，因此服务器通常保守地不压缩数据。

Still, the content-encoding field of the response is set, so you have the way to know that your page is indeed gzip-compressed, and you can decompress it using Python gzip module before further processing. 仍然，响应的content-encoding字段已设置，因此您可以知道您的页面确实是gzip压缩的，并且您可以在进一步处理之前使用Python gzip模块对其进行解压缩。

import urllib.request
import gzip
file = urllib.request.urlopen(webAddress)
data = file.read()
if file.headers['content-encoding'].lower() == 'gzip':
    data = gzip.decompress(data)
file.close()
dataString = data.decode(encoding='UTF-8')

OTOH, if you have the possibility to use the requests module it will handle all this mess by itself, including compression (did I mention that you may also get deflate besides gzip , which is the same but with different headers ?) and (at least partially) encoding. OTOH，如果你有可能使用requests模块，它将自己处理所有这些混乱，包括压缩（我是否提到除gzip之外你也可以deflate ，这是相同的但是有不同的标题？）和（至少部分）编码。

import requests
webAddress = "https://projects.fivethirtyeight.com/2018-nba-predictions/"
r = requests.get(webAddress)
print(repr(r.text))

This will perform your request and correctly print out the already-decoded Unicode string. 这将执行您的请求并正确打印出已解码的Unicode字符串。

Answer 2

您正在阅读gzip压缩数据： http ： //www.forensicswiki.org/wiki/Gzip您必须解压缩它。

为什么我无法解码这个UTF-8页面？

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-12-10 16:32:33

解决方案2
2 2017-12-10 16:14:31

为什么我无法解码这个UTF-8页面？

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-12-10 16:32:33

解决方案2 2 2017-12-10 16:14:31

解决方案1
4 已采纳 2017-12-10 16:32:33

解决方案2
2 2017-12-10 16:14:31