简体   繁体   中英

how to scrape websites with infinite scrolling with load more button using python and selenium

I want to scrape facebook's mbasic.facebook.com interface. It has load more button to scroll down to new posts. I have been doing much of research on facebook's regular interface scraping and found this Scraping infinite scrolling website with Selenium in Python

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Chrome()
        self.driver.implicitly_wait(30)
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get("https://www.facebook.com")
        elem = driver.find_element_by_name("email")
        elem.clear()
        elem.send_keys("")

        elem2 = driver.find_element_by_name("pass")
        elem2.clear()
        elem2.send_keys("")
        elem2.send_keys(Keys.RETURN)
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')
        print(data)


if __name__ == "__main__":
    unittest.main()

But I don't want to make a loop, rather I would want to trigger an event like, If user manually presses the load more posts button, the new page is loaded and I get page source of the page. Is there any way to do that? Any help would be appreciated.

So are you trying to get the page source each time you load more posts? Because that code doesn't reflect that. Assuming you want the source code each time the new list of posts loads, you can locate and click the "More Posts" button using an XPath.

for i in range(1, 10):
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    driver.find_element_by_xpath('//span[contains(., "More")]/..').click()
    html_source = driver.page_source
    data = html_source.encode('utf-8')
    print(data)
    sleep(4)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM