Currently I have Selenium hooked up to python to scrape a webpage. I found out that the page actually pulls data from a JSON API, and I can get a JSON response as long as I'm logged in to the page.
However, my approach of getting that response into python seems a bit junky; I select text enclosed in <pre>
tags and use python's json
package to parse the data like so:
import json
from selenium import webdriver
url = 'http://jsonplaceholder.typicode.com/posts/1'
driver = webdriver.Chrome()
driver.get(url)
json_text = driver.find_element_by_css_selector('pre').get_attribute('innerText')
json_response = json.loads(json_text)
The only reason I need to select within <pre>
tags at all is because when JSON appears in Chrome, it comes formatted like this:
<html>
<head></head>
<body>
<pre style="word-wrap: break-word; white-space: pre-wrap;">{
"userId": 1,
"id": 1,
"title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
"body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}</pre>
</body>
</html>
And the only reason I need to do this inside selenium at all is because I need to be logged into the website in order to get a response. Otherwise I get a 401 and no data.
You can find the pre
element and get it's text, then load it via json.loads()
:
import json
pre = driver.find_element_by_tag_name("pre").text
data = json.loads(pre)
print(data)
Also, if this does not work as-is, and, as suggested by @Skandix in comments, prepend view-source:
to your url .
Also, you may avoid using selenium
to get the desired JSON data and transfer the cookies from selenium
to requests
to keep "staying logged in", see:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.