简体   繁体   中英

Issues with drop-down button with Selenium

I am having some issues with selecting a drop-down button to then select additional options to change the web page. I am using Selenium in Python to extract this data. URL is https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019

Code so far:

driver = webdriver.Chrome('C:/Users/bzholle/chromedriver.exe')
driver.get('https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019')

#click out of iframe pop-up window
driver.switch_to.frame(driver.find_element_by_css_selector('iframe[title="SP Consent Message"]'))
accept_button = driver.find_element_by_xpath("//button[@title='ACCEPT ALL']")
accept_button.click()

driver.find_element_by_id("choosen-country").click()

I keep getting: NoSuchElementException: Message: no such element: Unable to locate element

In the HTML code, the list of countries does not appear until the drop down arrow is clicked; howevever I cannot get the button to click. Anyone have any suggestions?

There are two problems here:

  1. after pressing the accept button, you need to add the line driver.switch_to.default_content() to switch back out of the iframe
  2. The element you are trying to identify is inside of a shadow root . The only way I know to identify such an element is kind of hacky, it involves executing javascript to get the shadow root, then finding the element in the shadow root. If I use this code, it works to click that element:
driver = webdriver.Chrome('C:/Users/bzholle/chromedriver.exe')
driver.get('https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/?saison_id=2019')

#click out of iframe pop-up window
driver.switch_to.frame(driver.find_element_by_css_selector('iframe[title="SP Consent Message"]'))
accept_button = driver.find_element_by_xpath("//button[@title='ACCEPT ALL']")
accept_button.click()

driver.switch_to.default_content()

shadow_section = driver.execute_script('''return document.querySelector("tm-quick-select-bar").shadowRoot''')

shadow_section.find_element_by_id("choosen-country").click()

You neglected to mention what information you're actually trying to scrape, so the following alternative solution I'm proposing can only help you so much. If you could elaborate, and let me know what information you're trying to scrape, I can tailor my solution.

Logging ones network traffic (while viewing the page in a browser) reveals that multiple XHR (XmlHttpRequest) HTTP GET requests are made to various REST API endpoints, the response of which is JSON, and contains all the information you are likely to want to scrape.

What I'm suggesting is to simply imitate that HTTP GET request to the neccessary REST API endpoints. No Selenium required:

def get_country_id(country_name):
    import requests

    url = "https://www.transfermarkt.com/quickselect/countries"

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return next((country["id"] for country in response.json() if country["name"] == country_name), None)


def get_competitions(country_id):
    import requests

    url = "https://www.transfermarkt.com/quickselect/competitions/{}".format(country_id)

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return response.json()

def main():

    country_name = "Iceland"

    country_id = get_country_id(country_name)
    assert country_id is not None

    print("Competitions in {}:".format(country_name))
    for competition in get_competitions(country_id):
        print(competition["name"])
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Competitions in Iceland:
Pepsi Max deild
Lengjudeild
Mjólkurbikarinn
Lengjubikarinn
>>> 

EDIT - The table data you're trying to scrape unfortunately does not originate from an API. It's baked directly into the HTML of the page. Still, you don't need to use Selenium for this - BeautifulSoup is good enough:

def get_entries():
    import requests
    from bs4 import BeautifulSoup as Soup
    from operator import attrgetter

    url = "https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/"

    params = {
        "saison_id": "2019"
    }

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    table = soup.find("table", {"class": "items"})
    assert table is not None

    # Get text from header cells whose class does not contain the substring "hide"
    fieldnames = list(map(attrgetter("text"), table.select("thead > tr > th:not([class*=\"hide\"])")))
    yield fieldnames

    for row in table.select("tbody > tr"):
        # Assuming the first column will always be an img
        columns = list(map(attrgetter("text"), row.select("td:not([class*=\"hide\"])")[1:]))
        yield dict(zip(fieldnames, columns))

def main():

    from csv import DictWriter

    entries = get_entries()
    fieldnames = next(entries)

    with open("output.csv", "w", newline="") as file:
        writer = DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        
        for entry in entries:
            writer.writerow(entry)
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

CSV Output:

club,Squad,Total MV,ø MV
Man City,34,€1.27bn,€37.46m
Liverpool,56,€1.09bn,€19.53m
Spurs,36,€1.04bn,€28.94m
Chelsea,36,€797.00m,€22.14m
Man Utd,43,€775.20m,€18.03m
Arsenal,38,€680.55m,€17.91m
Everton,35,€525.50m,€15.01m
Leicester,32,€384.75m,€12.02m
West Ham,38,€371.75m,€9.78m
Wolves,44,€315.40m,€7.17m
Newcastle,41,€312.58m,€7.62m
Bournemouth,39,€311.20m,€7.98m
Watford,43,€270.65m,€6.29m
Southampton,36,€259.80m,€7.22m
Crystal Palace,33,€248.65m,€7.53m
Brighton,45,€225.83m,€5.02m
Burnley,35,€205.58m,€5.87m
Aston Villa,38,€184.60m,€4.86m
Norwich,38,€110.85m,€2.92m
Sheff Utd,34,€110.80m,€3.26m

The real solution would probably involve combining requests to REST APIs and scraping table data via BeautifulSoup - You would iterate over every country, every competition in that country and for every year. The updated code I've posted assumes we're only interested in the competition with ID GB1 (which is in England), and only for 2019.

EDIT - you'll have to tweak my solution a bit. I filter and retain only those columns whose class does not contain the substring "hide", but it turns out some of them are important (like the age column, for example).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM