简体   繁体   English

为什么我无法解码这个UTF-8页面?

[英]Why can't I decode this UTF-8 page?

Howdy folks, 你好伙计们,

I'm new to getting data from the web using python. 我是使用python从网上获取数据的新手。 I'd like to have the source code of this page in a string: https://projects.fivethirtyeight.com/2018-nba-predictions/ 我想将这个页面的源代码放在一个字符串中: https//projects.fivethirtyeight.com/2018-nba-predictions/

The following code has worked for other pages (such as https://www.basketball-reference.com/boxscores/201712090ATL.html ): 以下代码适用于其他页面(例如https://www.basketball-reference.com/boxscores/201712090ATL.html ):

import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')

And I'd expect dataString to be a string of HTML (see below for my expectations in this specific case) 我希望dataString是一个HTML字符串(在这个具体情况下,我的期望见下文)

<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc

Instead, for the 538 website, I get this error: 相反,对于538网站,我收到此错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

My research has suggested that the problem is that my file isn't actually encoded using UTF-8, but both the page's charset and beautiful-soup's UnicodeDammit() claims it's UTF-8 (the second might be because of the first). 我的研究表明问题在于我的文件实际上并没有使用UTF-8进行编码,但是页面的charset和beautiful-soup的UnicodeDammit()声称它是UTF-8(第二个可能是因为第一个)。 chardet.detect() doesn't suggest any encoding. chardet.detect()不建议任何编码。 I've tried substituting the following for 'UTF-8' in the encoding parameter of decode() to no avail: 我尝试在decode()的编码参数中用以下代替'UTF-8'无效:

ISO-8859-1 ISO-8859-1

latin-1 拉丁-1

Windows-1252 Windows的1252

Perhaps worth mentioning is that the byte array data doesn't look like I'd expect it to. 也许值得一提的是,字节数组数据看起来并不像我期望的那样。 Here's data[:10] from a working URL: 这是来自工作网址的数据[:10]:

b'\n<!DOCTYPE'

Here's data[:10] from the 538 site: 这是来自538网站的数据[:10]:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

What's up? 这是怎么回事?

The server provided you with gzip-compressed data; 服务器为您提供了gzip压缩数据; this is not completely common, as urllib by default doesn't set any accept-encoding value, so servers generally conservatively don't compress the data. 这并不常见,因为urllib默认情况下不设置任何accept-encoding值,因此服务器通常保守地不压缩数据。

Still, the content-encoding field of the response is set, so you have the way to know that your page is indeed gzip-compressed, and you can decompress it using Python gzip module before further processing. 仍然,响应的content-encoding字段设置,因此您可以知道您的页面确实是gzip压缩的,并且您可以在进一步处理之前使用Python gzip模块对其进行解压缩。

import urllib.request
import gzip
file = urllib.request.urlopen(webAddress)
data = file.read()
if file.headers['content-encoding'].lower() == 'gzip':
    data = gzip.decompress(data)
file.close()
dataString = data.decode(encoding='UTF-8')

OTOH, if you have the possibility to use the requests module it will handle all this mess by itself, including compression (did I mention that you may also get deflate besides gzip , which is the same but with different headers ?) and (at least partially) encoding. OTOH,如果你有可能使用requests模块,它将自己处理所有这些混乱,包括压缩(我是否提到除gzip之外你也可以deflate ,这是相同的但是有不同的标题 ?)和(至少部分)编码。

import requests
webAddress = "https://projects.fivethirtyeight.com/2018-nba-predictions/"
r = requests.get(webAddress)
print(repr(r.text))

This will perform your request and correctly print out the already-decoded Unicode string. 这将执行您的请求并正确打印出已解码的Unicode字符串。

您正在阅读gzip压缩数据: http//www.forensicswiki.org/wiki/Gzip您必须解压缩它。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我不能使用 utf-8 解码任何字节? - Why can't I decode any byte using utf-8? UnicodeDecodeError:&#39;utf-8&#39;无法解码字节 - UnicodeDecodeError: 'utf-8' can't decode byte UnicodeDecodeError utf-8 无法解码字节 - UnicodeDecodeError utf-8 can't decode byte 为什么我会收到“UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 227: invalid start byte”错误 - Why do I get a “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 227: invalid start byte” error UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码位置0的字节0xff:尝试编码时无效的起始字节(&#39;utf-8&#39;) - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte when I tried to encode('utf-8') 为什么 &#39;encode(&quot;utf-8&quot;, &#39;ignore&#39;).decode(&quot;utf-8&quot;)&#39; 在 Python 3 中不去除非 UTF8 字符? - Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3? UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码字节:但我不知道代码在哪里 - UnicodeDecodeError: 'utf-8' codec can't decode byte : but I don't know where in the code 如何在python中修复“ UnicodeDecodeError:&#39;utf-8&#39;编解码器无法解码字节…”? - How can I fix “UnicodeDecodeError: 'utf-8' codec can't decode bytes…” in python? 为什么我不能将文件另存为utf-8格式 - why can't I save my file as utf-8 format python可以编码为utf-8但无法解码 - python can encode to utf-8 but can't decode
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM