[英]Beautifulsoup doesn't scrape data consistently every time
我正在尝试在此网站上获取球员的姓名和评分: https://www.whoscored.com/Matches/1549539/LiveStatistics/England-Premier-League-2021-2022-Brentford-Arsenal 。
抓取后,我将数据放入 csv 中。 但是,它不会始终如一地刮擦。 我可能必须多次运行脚本(2-5 次)才能让它抓取数据。 当我尝试抓取其他匹配项时,也会发生这种情况。 例如,如果我从 3 个匹配项中获取数据,它可能只会抓取第一个匹配项,而不会抓取其他页面的剩余数据。 这是我的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
match_link='https://www.whoscored.com/Matches/1549539/Live/England-Premier-League-2021-2022-Brentford-Arsenal.'
driver=webdriver.Chrome('C:\\Program Files (x86)\\chromedriver.exe')
driver.get(match_link)
soup=BeautifulSoup(driver.page_source,'html.parser')
Players_list=[]
Player_rating=[]
try:
player_name=soup.select('a.player-link span.iconize.iconize-icon-left')
player_rating=soup.select('td.rating')
#print('------------getting player name and ratings-----------')
for nme in player_name:
#print(nme.text)
Players_list.append(nme.text)
for rat in player_rating:
#print(rat.text)
Player_rating.append(rat.text)
except:
print('NO ELEMENT')
Players_list=pd.DataFrame(Players_list)
Player_rating=pd.DataFrame(Player_rating)
df=pd.concat([Players_list,Player_rating],axis=1)
df.to_csv('brentford-arsenal.csv')
它不会引发错误。 它只返回一个空结果(意味着数据没有被抓取)。 元素选择正确,但问题在于脚本的不一致。
Empty DataFrame
Columns: []
Index: []
您应该添加等待页面呈现。
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.whoscored.com/Matches/1549539/LiveStatistics/England-Premier-League-2021-2022-Brentford-Arsenal'
driver = webdriver.Chrome('C:\\Program Files (x86)\\chromedriver.exe')
driver.get(url)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'player-table-statistics-body'))
)
此外,如果您总是想使用最新的 chrome 驱动程序,那么新的web 管理器会自动检测新驱动程序何时准备好并缓存它。 安装管理器: pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
# web driver goes to page
driver.get(url)
...
根据您的问题,最小的工作解决方案如下:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
match_link = 'https://www.whoscored.com/Matches/1549539/LiveStatistics/England-Premier-League-2021-2022-Brentford-Arsenal'
driver = webdriver.Chrome('chromedriver')
driver.maximize_window()
time.sleep(8)
driver.get(match_link)
time.sleep(5)
p=[]
q=[]
soup = BeautifulSoup(driver.page_source, 'html.parser')
divs = soup.select('#statistics-table-home-summary a span.iconize.iconize-icon-left')
for div in divs:
player = div.text
p.append(player)
#print(player)
rs = soup.select('#statistics-table-home-summary td.rating')
for r in rs:
rating = r.text
#print(rating)
q.append(rating)
cols= ['player','rating']
df = pd.DataFrame(data=list(zip(p,q)),columns=cols)
print(df)
# df.to_csv('brentford-arsenal.csv',index=False)
player rating
0 David Raya 7.60
1 Ethan Pinnock 7.57
2 Kristoffer Ajer 6.71
3 Pontus Jansson 6.92
4 Sergi Canós 8.78
5 Rico Henry 6.79
6 Christian Nørgaard 7.67
7 Vitaly Janelt 6.74
8 Frank Onyeka 7.06
9 Bryan Mbeumo 7.15
10 Ivan Toney 7.33
11 Mads Bidstrup 6.21
12 Mads Bech Sørensen 6.11
13 Marcus Forss 5.98
14 Mads Roerslev -
15 Yoane Wissa -
16 Saman Ghoddos -
17 Halil Dervisoglu -
18 Charlie Goode -
19 Patrik Gunnarsson -
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.