使用 Python 请求模拟单击“显示更多”按钮

Question

I am not sure what code to use for clicking the show more button.我不确定用于单击“显示更多”按钮的代码。 I want to get a list of university who are doing certain topic.我想得到一份正在做特定主题的大学名单。 below is one of the websites以下是其中一个网站

http://www.sciencedirect.com/science/article/ http://www.sciencedirect.com/science/article/

your helps will be true appreciated您的帮助将不胜感激

Thanks谢谢

Answer 1

You shouldn't have to simulate, in Python, an actual "click" of the "show more" button to accomplish web-scraping.您不必在 Python 中模拟实际“单击”“显示更多”按钮来完成网络抓取。

"Show more" buttons in websites are usually tied to some JavaScript that either reveals a hidden element already in the HTML (see Bootstrap's collapse class for a typical example) or fires off a request to some web service (eg a REST API ) for information to insert in the DOM .网站中的“显示更多”按钮通常与某些 JavaScript 相关联，这些 JavaScript 要么显示 HTML 中已有的隐藏元素（典型示例请参见Bootstrap 的collapse类），要么向某些 Web 服务（例如REST API ）发出请求以获取信息插入到DOM 中。

Either way, you can scrape that data.无论哪种方式，您都可以抓取该数据。 For the former, find the hidden element in the DOM (view the page's source [ Ctrl + U ] and search the HTML [ Ctrl + F ]), and use your typical webscraping tools.对于前者，找到 DOM 中隐藏的元素（查看页面源 [ Ctrl + U ] 并搜索 HTML [ Ctrl + F ]），并使用典型的网页抓取工具。 For the latter, use something like Google Dev Tools' Network tab to inspect the API request when you click "show more" and then try to replicate that request with Python.对于后者，当您单击“显示更多”并尝试使用 Python 复制该请求时，请使用类似 Google Dev Tools 的“网络”选项卡来检查 API 请求。

In the specific example you've given, it appears the data you want is stored in an HTML <script> tag as a JSON object.在您给出的特定示例中，您想要的数据似乎作为 JSON 对象存储在 HTML <script>标记中。 Search the HTML for the word "affiliation".在 HTML 中搜索单词“affiliation”。

Answer 2

You'll have to select a different tool to press a button.您必须选择不同的工具才能按下按钮。 One possible solution is Selenium, which can tell the browser to press the button.一种可能的解决方案是 Selenium，它可以告诉浏览器按下按钮。 The following example clicks the show more button.以下示例单击显示更多按钮。

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def executeTest():
    global driver
    driver.get('http://www.sciencedirect.com/science/article/pii/S2211926417300024')
    time.sleep(7)
    element = driver.find_element_by_xpath('//*[@id="app"]/div/div/div/section/div/div[2]/article/div[2]/button')
    element.click()
    time.sleep(3)

def startWebDriver():
    global driver
    options = Options()
    options.add_argument("--disable-infobars")
    driver = webdriver.Chrome(chrome_options=options)

if __name__ == "__main__":
    startWebDriver()
    executeTest()
    driver.quit()

Answer 3

I just got round a similar problem by reading Michael Crenshaw's answer above.通过阅读上面迈克尔·克伦肖的回答，我刚刚解决了一个类似的问题。 Here's what worked for me:这对我有用：

Load the page you want to scrape.加载要抓取的页面。
Go to inspect and select network.去检查并选择网络。
Now click the show more button.现在单击显示更多按钮。

You should now see in the network tab the exact url where the request is being made to.您现在应该在网络选项卡中看到发出请求的确切网址。 It's a lot easier if you don't open the network tab until the page has already loaded - that way the only information in the tab is the request it makes when you click show more.如果您在页面加载完毕之前不打开网络选项卡会容易得多 - 这样选项卡中的唯一信息就是当您单击“显示更多”时它发出的请求。

I then just added a few lines to my code like this:然后我只是在我的代码中添加了几行，如下所示：

page_source = response.text
if "Show More" in page_source:

And then after this I added my scraping function to that while getting it to iterate through the url structure.然后在此之后，我添加了我的抓取功能，同时让它遍历 url 结构。 There's a good post on how to do that on scrapy here - https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016 .这里有一篇关于如何在 scrapy 上做到这一点的好帖子 - https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016 。

Hope this helps.希望这会有所帮助。

使用 Python 请求模拟单击“显示更多”按钮

问题描述

3 个解决方案

解决方案1
4 2018-01-09 04:23:37

解决方案2
2 2018-01-09 02:36:02

解决方案3
0 2020-05-08 11:32:17

使用 Python 请求模拟单击“显示更多”按钮

问题描述

3 个解决方案

解决方案1 4 2018-01-09 04:23:37

解决方案2 2 2018-01-09 02:36:02

解决方案3 0 2020-05-08 11:32:17

解决方案1
4 2018-01-09 04:23:37

解决方案2
2 2018-01-09 02:36:02

解决方案3
0 2020-05-08 11:32:17