简体   繁体   中英

Multiple scraping: problem in the code. What am I doing wrong?

I am trying to use Selenium scraping on multiple elements. A multiple scrap with multiple scraped elements that create a row that will fit into the database. I have never created a multiple scraping so far, but I have always scraped single elements. So there is some problem in the code.

I would like to create this row for each round (round 1, round 2, etc.) of the championship: Round, Date, Team_Home, Team_Away, Result_Home, Result_Away . In detail, just for information and to give you a better idea, there will be 8 rows for each championship round. The total turns are 26. I'm not getting any errors, but the output is just >>>. I only receive this >>>, with no text or errors.

PS: scraping is for the purpose of my personal study. It is not for profit

I would like to get, for example, this:

#SWEDEN ALLSVENKAN
#Round, Date, Team_Home, Team_Away, Result_Home, Result_Away

Round 1, 11/31/2021 20:45, AIK Stockholm, Malmo, 2, 1
Round 1, 11/31/2021 20:45, Elfsborg, Gothenburg, 2, 3
...and the rest of the other matches of the 1st round

Round 2, 06/12/2021 20:45, Gothenburg, AIK Stockholm, 0, 1
Round 2, 06/12/2021 20:45, Malmo, Elfsborg, 1, 1
...and the rest of the other matches of the 2st round

Round 3, etc.

Python code for scraping:

Values_Allsvenskan = []

#SCRAPING
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
driver.minimize_window()

for Allsvenskan in multiple_scraping:

    try:
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
    except:
        pass

    multiple_scraping = round, date, team_home, team_away, score_home, score_away

    #row/record
    round = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__round event__round--static']")
    date = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__time']") #data e ora è tutto un pezzo su diretta.it
    team_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--home']")            
    team_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--away']")
    score_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--home']")
    score_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--away']")   


    Allsvenskan_text = round.text, date.text, team_home.text, team_away.text, score_home.text, score_away.text
    Values_Allsvenskan.append(tuple([Allsvenskan_text]))
    print(Allsvenskan_text)
driver.close


    #INSERT IN DATABASE
    con = sqlite3.connect('/database.db')
    cursor = con.cursor()
    sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score (round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
    cursor.executemany(sqlite_insert_query_Allsvenskan, Values_Allsvenskan)
    con.commit()  

Based on my python code, can you show me how I can fix and fix the code? Thanks

You use find_elements to get lists with all rounds , all date , all team_home , all team_away , etc. so you have values in separated list and you should use zip() to group values in lists like [ single round , single date , single team_home , ...]`

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

I skiped round because it makes more problems it will need totally differnt code.

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

round = driver.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
date = driver.find_elements(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
team_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
team_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
score_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
score_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

Result:

['01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']
['28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
['28.10. 19:00', 'Norrkoping', 'Mjallby', '2', '2']
['27.10. 19:00', 'Kalmar', 'Varbergs', '2', '2']
['27.10. 19:00', 'Malmo FF', 'AIK Stockholm', '1', '0']
['27.10. 19:00', 'Östersunds', 'Hacken', '1', '1']
['27.10. 19:00', 'Sirius', 'Hammarby', '0', '1']
['25.10. 19:00', 'Örebro', 'Degerfors', '1', '2']
['24.10. 17:30', 'AIK Stockholm', 'Norrkoping', '1', '0']
...

But this method may sometimes makes problem - if some row has empty place then it will move value from next row to current row, etc. This way it can create wrong rows.

Better is to find all rows ( div or tr in table ) and next use for-loop to work with every row separatelly and use row.find_elements instead of driver.find_elements . This should also resolve problem with round which will need to read value and later duplicate it in next rows.

I search rows with event__round or event__match and next I check what classes has row. If it has event__round then I get round . If it has event__match then I use find_element without s at the end to get single date , single team_home , single team_away , etc (because in single row there are only single values) and use them with current_round to create row.

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

results = []

current_round = '?'

for row in all_rows:
    classes = row.get_attribute('class')
    #print(classes)
    
    if 'event__round' in classes:
        #round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
        current_round = row.text
    else:
        date = row.find_element(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
        team_home = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
        team_away = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
        score_home = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
        score_away = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   
    
        row = [current_round, date.text, team_home.text, team_away.text, score_home.text, score_away.text]
        print(row)
        results.append(row)

Result:

['Giornata 26', '01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['Giornata 26', '01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['Giornata 26', '01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['Giornata 26', '31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['Giornata 26', '31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['Giornata 26', '30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['Giornata 26', '30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['Giornata 26', '30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']

['Giornata 25', '28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['Giornata 25', '28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['Giornata 25', '28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
# ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM