简体   繁体   English

如何使用python和selenium通过使用load more按钮的无限滚动来抓取网站

[英]how to scrape websites with infinite scrolling with load more button using python and selenium

I want to scrape facebook's mbasic.facebook.com interface. 我想抓取Facebook的mbasic.facebook.com界面。 It has load more button to scroll down to new posts. 它具有加载更多按钮以向下滚动到新帖子。 I have been doing much of research on facebook's regular interface scraping and found this Scraping infinite scrolling website with Selenium in Python 我一直在做大量有关Facebook常规界面抓取的研究,并使用Python中的Selenium找到了这个Scraping无限滚动网站

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Chrome()
        self.driver.implicitly_wait(30)
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get("https://www.facebook.com")
        elem = driver.find_element_by_name("email")
        elem.clear()
        elem.send_keys("")

        elem2 = driver.find_element_by_name("pass")
        elem2.clear()
        elem2.send_keys("")
        elem2.send_keys(Keys.RETURN)
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')
        print(data)


if __name__ == "__main__":
    unittest.main()

But I don't want to make a loop, rather I would want to trigger an event like, If user manually presses the load more posts button, the new page is loaded and I get page source of the page. 但是我不想循环,而是想触发一个事件,例如,如果用户手动按下“加载更多帖子”按钮,则将加载新页面,并且我将获得该页面的页面来源。 Is there any way to do that? 有什么办法吗? Any help would be appreciated. 任何帮助,将不胜感激。

So are you trying to get the page source each time you load more posts? 那么,您是否在每次加载更多帖子时尝试获取页面源代码? Because that code doesn't reflect that. 因为该代码无法反映这一点。 Assuming you want the source code each time the new list of posts loads, you can locate and click the "More Posts" button using an XPath. 假设每次加载新帖子列表时都需要源代码,则可以使用XPath找到并单击“更多帖子”按钮。

for i in range(1, 10):
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    driver.find_element_by_xpath('//span[contains(., "More")]/..').click()
    html_source = driver.page_source
    data = html_source.encode('utf-8')
    print(data)
    sleep(4)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM