如何使用 python Selenium 滾動抓取動態加載網站

Question

我想湊這個ECB網站上所有的貨幣政策報告，在這里使用Python的硒包。 下面是我的代碼：

from selenium import webdriver

CHROME_PATH = <INSERT_CHROME_PATH_HERE>

url = "https://www.ecb.europa.eu/press/govcdec/mopo/html/index.en.html"

xpath = """//*[@id='snippet*']/dd/div[2]/span/a | # xpath of monetary policy report links
//*[@id='snippet1']/dd/div[2]/span/a |
//*[@id='snippet2']/dd/div[2]/span/a |
//*[@id='snippet3']/dd/div[2]/span/a |
//*[@id='snippet4']/dd/div[2]/span/a |
//*[@id='snippet5']/dd/div[2]/span/a |
//*[@id='snippet6']/dd/div[2]/span/a |
//*[@id='snippet7']/dd/div[2]/span/a |
//*[@id='snippet8']/dd/div[2]/span/a |
//*[@id='snippet9']/dd/div[2]/span/a |
//*[@id='snippet10']/dd/div[2]/span/a |
//*[@id='snippet11']/dd/div[2]/span/a |
//*[@id='snippet12']/dd/div[2]/span/a |
//*[@id='snippet13']/dd/div[2]/span/a |
//*[@id='snippet14']/dd/div[2]/span/a |
//*[@id='snippet15']/dd/div[2]/span/a |
//*[@id='snippet16']/dd/div[2]/span/a |
//*[@id='snippet17']/dd/div[2]/span/a |
//*[@id='snippet18']/dd/div[2]/span/a |
//*[@id='snippet19']/dd/div[2]/span/a |
//*[@id='snippet20']/dd/div[2]/span/a |
//*[@id='snippet21']/dd/div[2]/span/a |
//*[@id='snippet22']/dd/div[2]/span/a 
"""

wait_until_selector = "#snippet22 > dd:nth-child(2) > div.ecb-langSelector > span > a" # css selector of last link on page

def get_tags_by_xpath_on_page(
    driver: webdriver.Chrome, wait_until_selector: str, xpath: str
) -> List[str]:

    driver.maximize_window()
    driver.get(url)
    driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);"
    )  # scroll to bottom
    TIMEOUT = 5
    try:
        element_present = EC.presence_of_element_located(
            (By.CSS_SELECTOR, wait_until_selector)
        )
        WebDriverWait(driver, TIMEOUT).until(element_present)
    except TimeoutException:
        print("Timed out waiting for page to load")
    elems = driver.find_elements_by_xpath(xpath)
    tags = [elem.get_attribute("href") for elem in elems]
    return tags

with webdriver.Chrome(CHROME_PATH) as driver:
    tags = get_tags_by_xpath_on_page(driver, wait_until_selector, xpath)

這目前僅在頁面最底部捕獲 1999 年貨幣政策報告的鏈接。 如何修復此代碼以抓取所有內容？

Answer 1

我已經瀏覽了 javascript 和 html 並在初始頁面加載后調用，並意識到您可能想要的是如下所示的鏈接：

https://www.ecb.europa.eu/press/govcdec/mopo/2019/html/index_include.en.html https://www.ecb.europa.eu/press/govcdec/mopo/2018/html/index_include .en.html https://www.ecb.europa.eu/press/govcdec/mopo/2017/html/index_include.en.html

...

https://www.ecb.europa.eu/press/govcdec/mopo/2012/html/index_include.en.html

2020 和 2021 也返回結果。

如果您查看加載初始頁面后加載的 URL（通過“網絡”選項卡下的 chrome 開發工具），當您向下滾動時，被調用的 URL 遵循一個相當明顯的模式。

您可以首先在https://www.ecb.europa.eu/shared/nav/navigation.min.en.json?v=1626262372 中搜索GET請求，然后沿着調用堆棧向上查找請求你想要的可能是上述那些（我不建議初學者這樣做）。

還有另一個 javascript 響應，它返回一個可能有用的 Json 響應。 只需搜索網絡選項卡下的請求，然后從初始請求中選擇任何已加載項目的“預覽”子選項卡。 看起來很多，但如果你一個一個處理響應，它是可以管理的。

如何使用 python Selenium 滾動抓取動態加載網站

問題描述

1 個解決方案

解決方案1
1 2021-07-22 23:17:30

如何使用 python Selenium 滾動抓取動態加載網站

問題描述

1 個解決方案

解決方案1 1 2021-07-22 23:17:30

解決方案1
1 2021-07-22 23:17:30