简体   繁体   English

pycurl 和 curl 在请求相同资源时表现不同; curl 正确给出一个 JSON 对象,PycURL 一个 HTML 对象

[英]pycurl and curl behaving differently when requesting same resource; curl correctly gives a JSON object, PycURL a HTML object

ipinfo.io provides information about the website/server corresponding to an IP address, either by entering it on their website or by sending a request to them via the curl command line utility, eg: ipinfo.io 通过在他们的网站上输入或通过 curl 命令行实用程序向他们发送请求来提供有关与 IP 地址相对应的网站/服务器的信息,例如:

$ curl  https://ipinfo.io/172.217.169.6

outputs, in JSON format:输出,JSON 格式:

{
  "ip": "172.217.169.68",
  "hostname": "lhr48s09-in-f4.1e100.net",
  "city": "London",
  "region": "England",
  "country": "GB",
  "loc": "51.5085,-0.1257",
  "org": "AS15169 Google LLC",
  "postal": "EC1A",
  "timezone": "Europe/London",
  "readme": "https://ipinfo.io/missingauth"
}

What I'm trying to eventually do is do this in Python and store this result as a JSON object.我最终想要做的是在 Python 中执行此操作并将此结果存储为 JSON 对象。 I believe the following code, using pycURL should produce the same output:我相信以下代码,使用pycURL应该产生相同的输出:

import pycurl
from io import BytesIO

buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, "https://ipinfo.io/172.217.169.6")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close

body = buffer.getvalue()
print(body.decode('iso-8859-1'))

ie, write the same JSON string into the buffer.即,将相同的 JSON 字符串写入缓冲区。

However, it instead prints massive HTML output, ie I suspect the HTML from the actual page pycURL is requesting data from, rather than the JSON data.但是,它会打印大量的 HTML 输出,即我怀疑来自实际页面 pycURL 的 HTML 正在请求数据,而不是 JSON 数据。 eg:例如:

<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <meta charset="utf-8">
    <meta name="apple-itunes-app" content="app-id=917634022">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
    <meta name="description" content="Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map, hostname, and API details.">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...
    

</html>

Basically, how can I get pycURL to also receive this JSON data?基本上,我怎样才能让 pycURL 也接收这个 JSON 数据?



I tried comparing the verbose outputs of both, and I couldn't figure out why they behave differently, only that the content-type field is different;我尝试比较两者的详细输出,但我无法弄清楚为什么它们的行为不同,只是内容类型字段不同; "application/json" for curl and "text/html" for pycURL, which explains the different outputs. curl 的“application/json”和 pycURL 的“text/html”,解释了不同的输出。 At the risk of making this post extremely long-winded, I've provided them below also:冒着使这篇文章冗长的风险,我也在下面提供了它们:

curl (command line) verbose output: curl(命令行)详细输出:

$ curl -v https://ipinfo.io/172.217.169.6
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a887a40e10)
> GET /172.217.169.6 HTTP/2
> Host: ipinfo.io
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: application/json; charset=utf-8
< content-length: 286
< date: Tue, 27 Jul 2021 21:03:50 GMT
< x-envoy-upstream-service-time: 1
< via: 1.1 google
< alt-svc: clear
< 
{
  "ip": "172.217.169.6",
  "hostname": "lhr25s26-in-f6.1e100.net",
  "city": "London",
  "region": "England",
  "country": "GB",
  "loc": "51.5085,-0.1257",
  "org": "AS15169 Google LLC",
  "postal": "EC1A",
  "timezone": "Europe/London",
  "readme": "https://ipinfo.io/missingauth"
* Connection #0 to host ipinfo.io left intact
}

pycURL verbose output: pycURL详细输出:

$ python3 ip_helper.py
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x19d65c0)
> GET /172.217.169.6 HTTP/2
Host: ipinfo.io
user-agent: PycURL/7.43.0.6 libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
accept: */*

* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: text/html; charset=utf-8
< content-length: 44645
< date: Tue, 27 Jul 2021 21:07:50 GMT
< x-envoy-upstream-service-time: 13
< via: 1.1 google
< alt-svc: clear
< 
* Connection #0 to host ipinfo.io left intact
<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <meta charset="utf-8">
    <meta name="apple-itunes-app" content="app-id=917634022">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
    <meta name="description" content="
    
        Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map, hostname, and API details.
    
">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...

</html>

Thank you for your time感谢您的时间

From the docs :文档

We try to automatically detect when someone wants to call our API versus view our website, and then we send back the appropriate JSON response rather than HTML.我们尝试自动检测何时有人想要调用我们的 API 而不是查看我们的网站,然后我们发送回适当的 JSON 响应而不是 HTML。 We do this based on the user agent for known popular programming languages, tools, and frameworks.我们基于已知流行编程语言、工具和框架的用户代理来执行此操作。 However, there are a couple of other ways to force a JSON response when it doesn't happen automatically.但是,当 JSON 响应不会自动发生时,还有其他几种方法可以强制它做出响应。 One is to add /json to the URL, and the other is to set an Accept header to application/json一种是在URL中添加/json,另一种是在application/json中设置一个Accept头

So it looks like there's three different ways to get JSON back using pycurl .所以看起来有三种不同的方法可以使用pycurl获取 JSON 。

  1. Append /json to your URL:/json附加到您的 URL:
c.setopt(c.URL, "https://ipinfo.io/172.217.169.6/json")
  1. Set your Accept header to only allow JSON responses:将您的Accept标头设置为仅允许 JSON 响应:
c.setopt(c.HTTPHEADER, ["Accept: application/json"])
  1. Set your User-Agent header to make the web site think it's talking to curl instead of pycurl :设置您的User-Agent标头,使网站认为它在与curl而不是pycurl
c.setopt(c.HTTPHEADER, ["User-Agent: curl"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM