简体   繁体   English

如何从网站中提取 XHR 响应数据?

[英]How to extract XHR response data from the a website?

I want to get a link to a kind of json document that some webpages download after getting loaded.我想获得一些网页加载后下载的一种 json 文档的链接。 For instance on this webpage :例如在这个网页上

![![introducir la descriptción de la imagen aquí

But it can be a very different document on a different webpage .但它可以是不同网页上的一个非常不同的文档。 Unfortunately I can't find the link in the source page with Beautfiul soup.不幸的是,我在 Beautfiul 汤的源页面中找不到链接。

So far I tried this:到目前为止,我试过这个:

import requests
import json

data = {
  "Device[udid]": "",
  "API_KEY": "",
  "API_SECRET": "",
  "Device[change]": "",
  "fbToken": ""
}

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}

url = "https://data.electionsportal.ge/en/event_type/1/event/38/shape/69898/shape_type/1?data_type=official"

r = requests.post(url, data=data, headers=headers)
data = r.json()

But it returns a json decode error:但它返回 json 解码错误:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-72-189954289109> in <module>
     17 
     18 r = requests.post(url, data=data, headers=headers)
---> 19 data = r.json()
     20 

C:\ProgramData\Anaconda3\lib\site-packages\requests\models.py in json(self, **kwargs)
    895                     # used.
    896                     pass
--> 897         return complexjson.loads(self.text, **kwargs)
    898 
    899     @property

C:\ProgramData\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    346             parse_int is None and parse_float is None and
    347             parse_constant is None and object_pairs_hook is None and not kw):
--> 348         return _default_decoder.decode(s)
    349     if cls is None:
    350         cls = JSONDecoder

C:\ProgramData\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
    335 
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

C:\ProgramData\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
    353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The JSON you are trying to find in the HTML content is loaded by the client through javascript with XMLHttpRequests.您试图在HTML内容中找到的 JSON 由客户端通过带有 XMLHttpRequests 的 javascript 加载。 That means that you will not be able to use BeautifulSoup to find the tag in the HTML that contains the URL, it is inside either in a <script> block or loaded externally.这意味着您将无法使用 BeautifulSoup 在包含 URL 的 HTML 中找到标签,它位于<script>块中或外部加载。

Besides, you are trying to convert a webpage written in HTML into JSON.此外,您正在尝试将用 HTML 编写的网页转换为 JSON。 And attempting to access a key ( coins ) which is nowhere defined inside the webpage or the JSON content..并尝试访问网页或 JSON 内容中未定义的密钥(硬币)。

Solution解决方案

  1. Load that JSON directly, without attempting to find the JSON URL with BeautifulSoup in the aforementioned website.直接加载 JSON,而不尝试在上述 7D3CF54C0D53 网站中找到 JSON URL 和 ZC2ED0329D2D3CF7C0D53DZ。 By doing so, you would then be able to run requests.json() flawlessly.通过这样做,您将能够完美地运行requests.json()

  2. Otherwise, check out Selenium , it's a web driver that allows you to run javascript.否则,请查看Selenium ,它是一个 web 驱动程序,允许您运行 javascript。

Hope that clears it out.希望清除它。

This works for both links in your post:这适用于您帖子中的两个链接:

from bs4 import BeautifulSoup
import requests
url = 'https://data.electionsportal.ge/en/event_type/1/event/38/shape/69898/shape_type/1?data_type=official'
r = requests.get(url)
soup = BeautifulSoup(r.text)
splits = [item.split('=',1)[-1] for item in str(soup.script).split(';')]
filtered_splits = [item.replace('"','') for item in splits if 'json' in item and not 'xxx' in item]
links_to_jsons = ["https://data.electionsportal.ge" + item for item in    filtered_splits]
for item in links_to_jsons:
   r = requests.get(item)
   print(r.json())       # change as you want      

Btw.顺便提一句。 I am guessing that you can construct json links by changing number 69898 to number which is in similar position in another webpage ( but still data.electionsportal.ge).我猜你可以通过在另一个网页中将数字 69898 更改为类似 position 的数字来构建 json 链接(但仍然是 data.electionsportal.ge)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM