繁体   English   中英

多次抓取:代码中的问题。 我究竟做错了什么?

[英]Multiple scraping: problem in the code. What am I doing wrong?

我正在尝试在多个元素上使用 Selenium 抓取。 具有多个刮取元素的多重报废,这些元素创建了一个适合数据库的行。 到目前为止,我从未创建过多次抓取,但我总是抓取单个元素。 所以代码中存在一些问题。

我想为锦标赛的每一轮(第 1 轮、第 2 轮等)创建这一行: Round, Date, Team_Home, Team_Away, Result_Home, Result_Away 详细地说,仅供参考并为您提供更好的主意,每个锦标赛回合将有 8 行。 总转数为 26。我没有收到任何错误,但输出只是 >>>。 我只收到这个 >>>,没有文本或错误。

PS:刮刮是为了我个人的学习。 不是为了盈利

我想得到,例如,这个:

#SWEDEN ALLSVENKAN
#Round, Date, Team_Home, Team_Away, Result_Home, Result_Away

Round 1, 11/31/2021 20:45, AIK Stockholm, Malmo, 2, 1
Round 1, 11/31/2021 20:45, Elfsborg, Gothenburg, 2, 3
...and the rest of the other matches of the 1st round

Round 2, 06/12/2021 20:45, Gothenburg, AIK Stockholm, 0, 1
Round 2, 06/12/2021 20:45, Malmo, Elfsborg, 1, 1
...and the rest of the other matches of the 2st round

Round 3, etc.

用于抓取的 Python 代码:

Values_Allsvenskan = []

#SCRAPING
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
driver.minimize_window()

for Allsvenskan in multiple_scraping:

    try:
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
    except:
        pass

    multiple_scraping = round, date, team_home, team_away, score_home, score_away

    #row/record
    round = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__round event__round--static']")
    date = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__time']") #data e ora è tutto un pezzo su diretta.it
    team_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--home']")            
    team_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--away']")
    score_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--home']")
    score_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--away']")   


    Allsvenskan_text = round.text, date.text, team_home.text, team_away.text, score_home.text, score_away.text
    Values_Allsvenskan.append(tuple([Allsvenskan_text]))
    print(Allsvenskan_text)
driver.close


    #INSERT IN DATABASE
    con = sqlite3.connect('/database.db')
    cursor = con.cursor()
    sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score (round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
    cursor.executemany(sqlite_insert_query_Allsvenskan, Values_Allsvenskan)
    con.commit()  

根据我的 python 代码,你能告诉我如何修复和修复代码吗? 谢谢

您使用find_elements获取包含所有rounds 、所有date 、所有team_home 、所有team_away等的列表,因此您在单独的列表中有值,并且您应该使用zip()将列表中的值分组,例如 [ single round , single date , single team_home , ...]`

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

我skiped round ,因为它使更多的问题,它需要完全不同的充码。

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

round = driver.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
date = driver.find_elements(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
team_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
team_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
score_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
score_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

结果:

['01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']
['28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
['28.10. 19:00', 'Norrkoping', 'Mjallby', '2', '2']
['27.10. 19:00', 'Kalmar', 'Varbergs', '2', '2']
['27.10. 19:00', 'Malmo FF', 'AIK Stockholm', '1', '0']
['27.10. 19:00', 'Östersunds', 'Hacken', '1', '1']
['27.10. 19:00', 'Sirius', 'Hammarby', '0', '1']
['25.10. 19:00', 'Örebro', 'Degerfors', '1', '2']
['24.10. 17:30', 'AIK Stockholm', 'Norrkoping', '1', '0']
...

但是这种方法有时可能会产生问题 - 如果某行有空位,那么它会将值从下一行移动到当前行,等等。这样它就可以创建错误的行。

更好的是找到所有行( table divtr ),然后使用for-loop单独处理每一行并使用row.find_elements而不是driver.find_elements 这也应该解决round需要读取值并随后在下一行中复制它的问题。

我使用event__roundevent__match搜索行, event__round检查哪些类有行。 如果它有event__round那么我得到round 如果它有event__match那么我使用find_element在末尾没有s来获取单个date 、单个team_home 、单个team_away等(因为在单行中只有单个值)并将它们与current_round一起使用来创建行。

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

results = []

current_round = '?'

for row in all_rows:
    classes = row.get_attribute('class')
    #print(classes)
    
    if 'event__round' in classes:
        #round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
        current_round = row.text
    else:
        date = row.find_element(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
        team_home = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
        team_away = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
        score_home = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
        score_away = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   
    
        row = [current_round, date.text, team_home.text, team_away.text, score_home.text, score_away.text]
        print(row)
        results.append(row)

结果:

['Giornata 26', '01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['Giornata 26', '01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['Giornata 26', '01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['Giornata 26', '31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['Giornata 26', '31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['Giornata 26', '30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['Giornata 26', '30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['Giornata 26', '30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']

['Giornata 25', '28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['Giornata 25', '28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['Giornata 25', '28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
# ...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM