简体   繁体   English

多次抓取:代码中的问题。 我究竟做错了什么?

[英]Multiple scraping: problem in the code. What am I doing wrong?

I am trying to use Selenium scraping on multiple elements.我正在尝试在多个元素上使用 Selenium 抓取。 A multiple scrap with multiple scraped elements that create a row that will fit into the database.具有多个刮取元素的多重报废,这些元素创建了一个适合数据库的行。 I have never created a multiple scraping so far, but I have always scraped single elements.到目前为止,我从未创建过多次抓取,但我总是抓取单个元素。 So there is some problem in the code.所以代码中存在一些问题。

I would like to create this row for each round (round 1, round 2, etc.) of the championship: Round, Date, Team_Home, Team_Away, Result_Home, Result_Away .我想为锦标赛的每一轮(第 1 轮、第 2 轮等)创建这一行: Round, Date, Team_Home, Team_Away, Result_Home, Result_Away In detail, just for information and to give you a better idea, there will be 8 rows for each championship round.详细地说,仅供参考并为您提供更好的主意,每个锦标赛回合将有 8 行。 The total turns are 26. I'm not getting any errors, but the output is just >>>.总转数为 26。我没有收到任何错误,但输出只是 >>>。 I only receive this >>>, with no text or errors.我只收到这个 >>>,没有文本或错误。

PS: scraping is for the purpose of my personal study. PS:刮刮是为了我个人的学习。 It is not for profit不是为了盈利

I would like to get, for example, this:我想得到,例如,这个:

#SWEDEN ALLSVENKAN
#Round, Date, Team_Home, Team_Away, Result_Home, Result_Away

Round 1, 11/31/2021 20:45, AIK Stockholm, Malmo, 2, 1
Round 1, 11/31/2021 20:45, Elfsborg, Gothenburg, 2, 3
...and the rest of the other matches of the 1st round

Round 2, 06/12/2021 20:45, Gothenburg, AIK Stockholm, 0, 1
Round 2, 06/12/2021 20:45, Malmo, Elfsborg, 1, 1
...and the rest of the other matches of the 2st round

Round 3, etc.

Python code for scraping:用于抓取的 Python 代码:

Values_Allsvenskan = []

#SCRAPING
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
driver.minimize_window()

for Allsvenskan in multiple_scraping:

    try:
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
    except:
        pass

    multiple_scraping = round, date, team_home, team_away, score_home, score_away

    #row/record
    round = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__round event__round--static']")
    date = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__time']") #data e ora è tutto un pezzo su diretta.it
    team_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--home']")            
    team_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--away']")
    score_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--home']")
    score_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--away']")   


    Allsvenskan_text = round.text, date.text, team_home.text, team_away.text, score_home.text, score_away.text
    Values_Allsvenskan.append(tuple([Allsvenskan_text]))
    print(Allsvenskan_text)
driver.close


    #INSERT IN DATABASE
    con = sqlite3.connect('/database.db')
    cursor = con.cursor()
    sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score (round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
    cursor.executemany(sqlite_insert_query_Allsvenskan, Values_Allsvenskan)
    con.commit()  

Based on my python code, can you show me how I can fix and fix the code?根据我的 python 代码,你能告诉我如何修复和修复代码吗? Thanks谢谢

You use find_elements to get lists with all rounds , all date , all team_home , all team_away , etc. so you have values in separated list and you should use zip() to group values in lists like [ single round , single date , single team_home , ...]`您使用find_elements获取包含所有rounds 、所有date 、所有team_home 、所有team_away等的列表,因此您在单独的列表中有值,并且您应该使用zip()将列表中的值分组,例如 [ single round , single date , single team_home , ...]`

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

I skiped round because it makes more problems it will need totally differnt code.我skiped round ,因为它使更多的问题,它需要完全不同的充码。

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

round = driver.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
date = driver.find_elements(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
team_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
team_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
score_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
score_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

Result:结果:

['01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']
['28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
['28.10. 19:00', 'Norrkoping', 'Mjallby', '2', '2']
['27.10. 19:00', 'Kalmar', 'Varbergs', '2', '2']
['27.10. 19:00', 'Malmo FF', 'AIK Stockholm', '1', '0']
['27.10. 19:00', 'Östersunds', 'Hacken', '1', '1']
['27.10. 19:00', 'Sirius', 'Hammarby', '0', '1']
['25.10. 19:00', 'Örebro', 'Degerfors', '1', '2']
['24.10. 17:30', 'AIK Stockholm', 'Norrkoping', '1', '0']
...

But this method may sometimes makes problem - if some row has empty place then it will move value from next row to current row, etc. This way it can create wrong rows.但是这种方法有时可能会产生问题 - 如果某行有空位,那么它会将值从下一行移动到当前行,等等。这样它就可以创建错误的行。

Better is to find all rows ( div or tr in table ) and next use for-loop to work with every row separatelly and use row.find_elements instead of driver.find_elements .更好的是找到所有行( table divtr ),然后使用for-loop单独处理每一行并使用row.find_elements而不是driver.find_elements This should also resolve problem with round which will need to read value and later duplicate it in next rows.这也应该解决round需要读取值并随后在下一行中复制它的问题。

I search rows with event__round or event__match and next I check what classes has row.我使用event__roundevent__match搜索行, event__round检查哪些类有行。 If it has event__round then I get round .如果它有event__round那么我得到round If it has event__match then I use find_element without s at the end to get single date , single team_home , single team_away , etc (because in single row there are only single values) and use them with current_round to create row.如果它有event__match那么我使用find_element在末尾没有s来获取单个date 、单个team_home 、单个team_away等(因为在单行中只有单个值)并将它们与current_round一起使用来创建行。

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

results = []

current_round = '?'

for row in all_rows:
    classes = row.get_attribute('class')
    #print(classes)
    
    if 'event__round' in classes:
        #round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
        current_round = row.text
    else:
        date = row.find_element(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
        team_home = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
        team_away = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
        score_home = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
        score_away = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   
    
        row = [current_round, date.text, team_home.text, team_away.text, score_home.text, score_away.text]
        print(row)
        results.append(row)

Result:结果:

['Giornata 26', '01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['Giornata 26', '01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['Giornata 26', '01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['Giornata 26', '31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['Giornata 26', '31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['Giornata 26', '30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['Giornata 26', '30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['Giornata 26', '30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']

['Giornata 25', '28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['Giornata 25', '28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['Giornata 25', '28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
# ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM