簡體   English   中英

Selenium 的下拉按鈕問題

[英]Issues with drop-down button with Selenium

我在選擇下拉按鈕時遇到一些問題,然后選擇 select 其他選項來更改 web 頁面。 我在 Python 中使用 Selenium 來提取這些數據。 URL 是https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019

到目前為止的代碼:

driver = webdriver.Chrome('C:/Users/bzholle/chromedriver.exe')
driver.get('https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019')

#click out of iframe pop-up window
driver.switch_to.frame(driver.find_element_by_css_selector('iframe[title="SP Consent Message"]'))
accept_button = driver.find_element_by_xpath("//button[@title='ACCEPT ALL']")
accept_button.click()

driver.find_element_by_id("choosen-country").click()

我不斷收到:NoSuchElementException:消息:沒有這樣的元素:無法找到元素

在 HTML 代碼中,在單擊下拉箭頭之前不會出現國家列表; 但是我無法點擊按鈕。 有人有什么建議嗎?

這里有兩個問題:

  1. 按下接受按鈕后,您需要添加行driver.switch_to.default_content()以切換回iframe
  2. 您嘗試識別的元素位於shadow root內。 我知道識別這樣一個元素的唯一方法是有點hacky,它涉及執行 javascript 來獲取影子根,然后在影子根中找到元素。 如果我使用此代碼,則可以單擊該元素:
driver = webdriver.Chrome('C:/Users/bzholle/chromedriver.exe')
driver.get('https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019')

#click out of iframe pop-up window
driver.switch_to.frame(driver.find_element_by_css_selector('iframe[title="SP Consent Message"]'))
accept_button = driver.find_element_by_xpath("//button[@title='ACCEPT ALL']")
accept_button.click()

driver.switch_to.default_content()

shadow_section = driver.execute_script('''return document.querySelector("tm-quick-select-bar").shadowRoot''')

shadow_section.find_element_by_id("choosen-country").click()

您忽略了提及您實際上要抓取的信息,因此我提出的以下替代解決方案只能對您有很大幫助。 如果您可以詳細說明,並讓我知道您要抓取哪些信息,我可以定制我的解決方案。

Logging ones network traffic (while viewing the page in a browser) reveals that multiple XHR (XmlHttpRequest) HTTP GET requests are made to various REST API endpoints, the response of which is JSON, and contains all the information you are likely to want to scrape .

我的建議是簡單地模仿 HTTP GET 請求到必要的 REST API 端點。 不需要 Selenium:

def get_country_id(country_name):
    import requests

    url = "https://www.transfermarkt.com/quickselect/countries"

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return next((country["id"] for country in response.json() if country["name"] == country_name), None)


def get_competitions(country_id):
    import requests

    url = "https://www.transfermarkt.com/quickselect/competitions/{}".format(country_id)

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return response.json()

def main():

    country_name = "Iceland"

    country_id = get_country_id(country_name)
    assert country_id is not None

    print("Competitions in {}:".format(country_name))
    for competition in get_competitions(country_id):
        print(competition["name"])
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Competitions in Iceland:
Pepsi Max deild
Lengjudeild
Mjólkurbikarinn
Lengjubikarinn
>>> 

編輯 - 不幸的是,您嘗試抓取的表數據並非來自 API。 它直接烘焙到頁面的 HTML 中。 不過,您不需要為此使用 Selenium - BeautifulSoup 就足夠了:

def get_entries():
    import requests
    from bs4 import BeautifulSoup as Soup
    from operator import attrgetter

    url = "https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/"

    params = {
        "saison_id": "2019"
    }

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    table = soup.find("table", {"class": "items"})
    assert table is not None

    # Get text from header cells whose class does not contain the substring "hide"
    fieldnames = list(map(attrgetter("text"), table.select("thead > tr > th:not([class*=\"hide\"])")))
    yield fieldnames

    for row in table.select("tbody > tr"):
        # Assuming the first column will always be an img
        columns = list(map(attrgetter("text"), row.select("td:not([class*=\"hide\"])")[1:]))
        yield dict(zip(fieldnames, columns))

def main():

    from csv import DictWriter

    entries = get_entries()
    fieldnames = next(entries)

    with open("output.csv", "w", newline="") as file:
        writer = DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        
        for entry in entries:
            writer.writerow(entry)
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

CSV Output:

club,Squad,Total MV,ø MV
Man City,34,€1.27bn,€37.46m
Liverpool,56,€1.09bn,€19.53m
Spurs,36,€1.04bn,€28.94m
Chelsea,36,€797.00m,€22.14m
Man Utd,43,€775.20m,€18.03m
Arsenal,38,€680.55m,€17.91m
Everton,35,€525.50m,€15.01m
Leicester,32,€384.75m,€12.02m
West Ham,38,€371.75m,€9.78m
Wolves,44,€315.40m,€7.17m
Newcastle,41,€312.58m,€7.62m
Bournemouth,39,€311.20m,€7.98m
Watford,43,€270.65m,€6.29m
Southampton,36,€259.80m,€7.22m
Crystal Palace,33,€248.65m,€7.53m
Brighton,45,€225.83m,€5.02m
Burnley,35,€205.58m,€5.87m
Aston Villa,38,€184.60m,€4.86m
Norwich,38,€110.85m,€2.92m
Sheff Utd,34,€110.80m,€3.26m

真正的解決方案可能涉及組合對 REST API 的請求並通過 BeautifulSoup 抓取表數據 - 您將迭代每個國家,該國家的每場比賽以及每年。 我發布的更新代碼假設我們只對 ID GB1 (在英格蘭)的競爭感興趣,並且只對 2019 年感興趣。

編輯 - 你將不得不稍微調整我的解決方案。 我只過濾並保留那些 class 不包含 substring “隱藏”的列,但事實證明其中一些很重要(例如age列)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM