简体   繁体   English

python请求无法解码utf-8 API响应

[英]python requests fails to decode utf-8 API response

I'm trying to get the response JSON from the following API endpoint https://datos.madrid.es/egob/catalogo/205026-0-cementerios.json .我正在尝试从以下 API 端点https://datos.madrid.es/egob/catalogo/205026-0-cementerios.json获取响应 JSON。 My code is:我的代码是:

import requests

url = 'https://datos.madrid.es/egob/catalogo/205026-0-cementerios.json'
r = requests.get(url)
r.json()

It fails with the error:它失败并出现错误:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

If I get the encoding from the request, it's empty.如果我从请求中获得编码,它是空的。 So I've tried to force the encoding before accesing it, with no success:所以我试图在访问之前强制编码,但没有成功:

import requests

url = 'https://datos.madrid.es/egob/catalogo/205026-0-cementerios.json'
r = requests.get(url)
r.encoding = 'utf-8'
r.json()

gives the same error.给出相同的错误。

r.text

returns something like:返回类似:

'\x00\x00\x01\x00\x01\x00  \x00\x00\x01\x00\x18\x0 .......

so looks it's not decoding properly the response.所以看起来它没有正确解码响应。

How can I get it successfully decoded?我怎样才能成功解码?

It seems to be zipped. 它似乎是拉链的。 Unzip it and then use json.decode . 解压缩然后使用json.decode The encoding is utf-8 . 编码是utf-8

Example: 例:

import zlib
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

Your URL is public, you can test it with your favorite browser. 您的网址是公开的,您可以使用自己喜欢的浏览器进行测试。 Chrome gives following headers: Chrome提供以下标题:

Cache-Control: no-cache
Connection: Keep-Alive
Content-disposition: inline;filename=205026-0-cementerios.json
Content-Encoding: gzip
Content-Length: 4383
Content-Type: application/json;charset=UTF-8
Date: Thu, 20 Dec 2018 12:19:33 GMT
OT-force-Account-Verify: true
Vary: Accept-Encoding
X-Frame-Options: SAMEORIGIN
X-UA-Compatible: IE=8
Xonnection: close

And after unzipping it looks like good json : 解压缩后看起来好像json

{
"@context": {
    "c": "http://www.w3.org/2002/12/cal#",
    "dcterms": "http://purl.org/dc/terms/",
    "geo": "http://www.w3.org/2003/01/geo/wgs84_pos#",
    "loc": "http://purl.org/ctic/infraestructuras/localizacion#",
    "org": "http://purl.org/ctic/infraestructuras/organizacion#",
    "vcard": "http://www.w3.org/2006/vcard/ns#",
    "title": "vcard:fn",
    "id": "dcterms:identifier",
    "relation": "dcterms:relation",
    "references": "dcterms:references",
    "address": "vcard:adr",
    "area": "loc:barrio",
    "district": "loc:distrito",
    "locality": "vcard:locality",
    "postal-code": "vcard:postal-code",
    "street": "vcard:street-address",
    "location": "vcard:geo",
    "latitude": "geo:lat",
    "longitude": "geo:long",
....

The server is doing something funky with user agent header (namely returning the favicon if it's not recognised!). 服务器正在使用用户代理标题做一些时髦的事情(即如果它不被识别则返回favicon!)。 You can work around this by forcing the user agent: 您可以通过强制用户代理解决此问题:

url = 'https://datos.madrid.es/egob/catalogo/205026-0-cementerios.json'
r = requests.get(url, headers={"User-Agent": "curl/7.61.0"})
print(r.json())

if you use headers and have "Accept-Encoding": "gzip, deflate, br" install brotli library with pip install.如果您使用标头并具有"Accept-Encoding": "gzip, deflate, br"请使用 pip install 安装 brotli 库。 You don't need to import brotli to your py file.您不需要将 brotli 导入您的 py 文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM