簡體   English   中英

如何使用 python Selenium 滾動抓取動態加載網站

[英]How do I scrape dynamically loading website with scrolling using python Selenium

我想湊這個ECB網站上所有的貨幣政策報告,在這里使用Python的硒包。 下面是我的代碼:

from selenium import webdriver

CHROME_PATH = <INSERT_CHROME_PATH_HERE>

url = "https://www.ecb.europa.eu/press/govcdec/mopo/html/index.en.html"

xpath = """//*[@id='snippet*']/dd/div[2]/span/a | # xpath of monetary policy report links
//*[@id='snippet1']/dd/div[2]/span/a |
//*[@id='snippet2']/dd/div[2]/span/a |
//*[@id='snippet3']/dd/div[2]/span/a |
//*[@id='snippet4']/dd/div[2]/span/a |
//*[@id='snippet5']/dd/div[2]/span/a |
//*[@id='snippet6']/dd/div[2]/span/a |
//*[@id='snippet7']/dd/div[2]/span/a |
//*[@id='snippet8']/dd/div[2]/span/a |
//*[@id='snippet9']/dd/div[2]/span/a |
//*[@id='snippet10']/dd/div[2]/span/a |
//*[@id='snippet11']/dd/div[2]/span/a |
//*[@id='snippet12']/dd/div[2]/span/a |
//*[@id='snippet13']/dd/div[2]/span/a |
//*[@id='snippet14']/dd/div[2]/span/a |
//*[@id='snippet15']/dd/div[2]/span/a |
//*[@id='snippet16']/dd/div[2]/span/a |
//*[@id='snippet17']/dd/div[2]/span/a |
//*[@id='snippet18']/dd/div[2]/span/a |
//*[@id='snippet19']/dd/div[2]/span/a |
//*[@id='snippet20']/dd/div[2]/span/a |
//*[@id='snippet21']/dd/div[2]/span/a |
//*[@id='snippet22']/dd/div[2]/span/a 
"""

wait_until_selector = "#snippet22 > dd:nth-child(2) > div.ecb-langSelector > span > a" # css selector of last link on page
def get_tags_by_xpath_on_page(
    driver: webdriver.Chrome, wait_until_selector: str, xpath: str
) -> List[str]:

    driver.maximize_window()
    driver.get(url)
    driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);"
    )  # scroll to bottom
    TIMEOUT = 5
    try:
        element_present = EC.presence_of_element_located(
            (By.CSS_SELECTOR, wait_until_selector)
        )
        WebDriverWait(driver, TIMEOUT).until(element_present)
    except TimeoutException:
        print("Timed out waiting for page to load")
    elems = driver.find_elements_by_xpath(xpath)
    tags = [elem.get_attribute("href") for elem in elems]
    return tags
with webdriver.Chrome(CHROME_PATH) as driver:
    tags = get_tags_by_xpath_on_page(driver, wait_until_selector, xpath)

這目前僅在頁面最底部捕獲 1999 年貨幣政策報告的鏈接。 如何修復此代碼以抓取所有內容?

我已經瀏覽了 javascript 和 html 並在初始頁面加載后調用,並意識到您可能想要的是如下所示的鏈接:

https://www.ecb.europa.eu/press/govcdec/mopo/2019/html/index_include.en.html https://www.ecb.europa.eu/press/govcdec/mopo/2018/html/index_include .en.html https://www.ecb.europa.eu/press/govcdec/mopo/2017/html/index_include.en.html

...

https://www.ecb.europa.eu/press/govcdec/mopo/2012/html/index_include.en.html

2020 和 2021 也返回結果。

如果您查看加載初始頁面后加載的 URL(通過“網絡”選項卡下的 chrome 開發工具),當您向下滾動時,被調用的 URL 遵循一個相當明顯的模式。

您可以首先在https://www.ecb.europa.eu/shared/nav/navigation.min.en.json?v=1626262372 中搜索GET請求,然后沿着調用堆棧向上查找請求你想要的可能是上述那些(我不建議初學者這樣做)。

還有另一個 javascript 響應,它返回一個可能有用的 Json 響應。 只需搜索網絡選項卡下的請求,然后從初始請求中選擇任何已加載項目的“預覽”子選項卡。 看起來很多,但如果你一個一個處理響應,它是可以管理的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM