简体   繁体   English

python requests.get() 返回未正确解码的文本而不是 UTF-8?

[英]python requests.get() returns improperly decoded text instead of UTF-8?

When the content-type of the server is 'Content-Type:text/html' , requests.get() returns improperly encoded data.content-type的服务器的是'Content-Type:text/html'requests.get()返回不正确的编码数据。

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8'但是,如果我们将内容类型明确设为'Content-Type:text/html; charset=utf-8' 'Content-Type:text/html; charset=utf-8' , it returns properly encoded data. 'Content-Type:text/html; charset=utf-8' ,它返回正确编码的数据。

Also, when we use urllib.urlopen() , it returns properly encoded data.此外,当我们使用urllib.urlopen() ,它会返回正确编码的数据。

Has anyone noticed this before?有没有人注意到这一点? Why does requests.get() behave like this?为什么requests.get()表现得像这样?

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).受过教育的猜测(如上所述)可能只是对服务器发送的Content-Type标头的检查(对受过教育的imho 的使用相当误导)。

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).对于响应头Content-Type: text/html ,结果是ISO-8859-1 (HTML4 的默认值),不管任何内容分析(即 HTML5 的默认值是 UTF-8)。

For response header Content-Type: text/html; charset=utf-8对于响应头Content-Type: text/html; charset=utf-8 Content-Type: text/html; charset=utf-8 the result is UTF-8 . Content-Type: text/html; charset=utf-8结果是UTF-8

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding ), so you usually want to do:对我们来说幸运的是, requests使用chardet库并且它通常工作得很好(属性requests.Response.apparent_encoding ),所以你通常想要这样做:

r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text

From requests documentation :请求文档

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers.当您发出请求时,Requests 会根据 HTTP 标头对响应的编码进行有根据的猜测。 The text encoding guessed by Requests is used when you access r.text.访问r.text时使用Requests猜测的文本编码。 You can find out what encoding Requests is using, and change it, using the r.encoding property.您可以使用 r.encoding 属性找出请求正在使用的编码并对其进行更改。

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.检查用于您的页面的编码请求,如果它不正确 - 尝试强制它成为您需要的编码请求。

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding.关于requestsurllib.urlopen之间的差异 - 他们可能使用不同的方式来猜测编码。 Thats all.就这样。

The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP. text/html 的默认假定内容编码是 ISO-8859-1 aka Latin-1 :( 请参阅 RFC-2854。UTF-8 还太年轻,无法成为默认值,它诞生于 1993 年,与 HTML 和HTTP。

Use .content to access the byte stream, or .text to access the decoded Unicode stream.使用.content访问字节流,或使用.text访问解码后的 Unicode 流。 If the HTTP server does not care about the correct encoding, the value of .text may be off.如果 HTTP 服务器不关心正确的编码, .text的值可能会关闭。

After getting response, take response.content instead of response.text and that will be of encoding utf-8 .得到响应后,使用response.content而不是response.text ,这将是utf-8编码。

response = requests.get(download_link, auth=(myUsername, myPassword),  headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
    body = response.content
else:
    print ("Unable to get response with Code : %d " % (response.status_code))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM