简体   繁体   English

如何在 selenium python 中从 JSON 中抓取

[英]how to grab from JSON in selenium python

My page returns JSON http response which contains id: 14我的页面返回包含 id: 14 的 JSON http 响应

Is there a way in selenium python to grab this? selenium python 有没有办法抓住这个? I searched the web and could not find any solutions.我在网上搜索,找不到任何解决方案。 Now I am wondering maybe its just not possible?现在我想知道也许它只是不可能? I could grab this id from the db but I am trying to avoid this.我可以从数据库中获取这个 id,但我试图避免这种情况。 Please tell me if there is any ways around.请告诉我是否有任何解决办法。 Thank you谢谢

The source of your difficulty is the fact that when a browser is returned raw JSON data, it wraps it in a tiny bit of HTML to make it visible to the user on the screen.您遇到困难的根源在于,当浏览器返回原始 JSON 数据时,它会将其包装在一小段 HTML 中,使其在屏幕上对用户可见。

When I visit https://httpbin.org/user-agent in Firefox, for example, the following raw JSON appears in my browser window:例如,当我在 Firefox 中访问https://httpbin.org/user-agent时,我的浏览器窗口中会出现以下原始 JSON:

{"user-agent": "Mozilla/5.0 (X11; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0"
}

But in fact Firefox (and Chrome) has wrapped the JSON in a bit of extra HTML in order to create a document it can actually display.但事实上 Firefox(和 Chrome)已经将 JSON 包装在一些额外的 HTML 中,以便创建一个它可以实际显示的文档。 Here is the HTML that Firefox wraps it in, which I can see right in the JavaScript console by evaluating the expression document.documentElement.innerHTML :这是 Firefox 包装它的 HTML,我可以通过评估表达式document.documentElement.innerHTML在 JavaScript 控制台中看到它:

<head><link rel="alternate stylesheet" type="text/css"
 href="resource://gre-resources/plaintext.css" title="Wrap Long Lines"></head>
 <body><pre>{"user-agent": "Mozilla/5.0 (X11; Linux x86_64; rv:42.0)
 Gecko/20100101 Firefox/42.0"
}
</pre></body>

Using BeautifulSoup to parse the HTML, as suggested in another answer, has two serious disadvantages: it introduces a new dependency to your project, and will also be quite slow compared to taking advantage of the fact that the browser will already have parsed the HTML for you and have the resulting DOM ready for your use.正如另一个答案中所建议的那样,使用 BeautifulSoup 来解析 HTML 有两个严重的缺点:它为您的项目引入了一个新的依赖项,并且与利用浏览器已经解析了 HTML 的事实相比,速度也会相当慢您并已准备好生成的 DOM 供您使用。

To ask the browser to extract the JSON for you, simply ask it for the text inside of the <body> element, and all of the extra structure that the browser has added will be excluded and the pure JSON be returned:要让浏览器为您提取 JSON,只需向它询问<body>元素内的文本,浏览器添加的所有额外结构都将被排除,并返回纯 JSON:

driver.find_element_by_tag_name('body').text

Or, if you want it parsed into a Python data structure:或者,如果您希望将其解析为 Python 数据结构:

import json
json.loads(driver.find_element_by_tag_name('body').text)

You can use BeautifulSoup to parse the page and extract the json.您可以使用 BeautifulSoup 解析页面并提取 json。 The code you need should look something like this.您需要的代码应该如下所示。 You may need to change the soup.find command if the json isn't directly in the body of the response.如果 json 不是直接在响应正文中,您可能需要更改soup.find命令。

from bs4 import BeautifulSoup
import json

soup = BeautifulSoup(driver.page_source)
dict_from_json = json.loads(soup.find("body").text)

The other solutions didn't work for me.其他解决方案对我不起作用。 I found this solution using requests to be fast and simple:我发现这个使用requests解决方案requests快速又简单:

import requests
requests.get(browser.current_url).json()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM