![](/img/trans.png)
[英]Click “Download csv” button using Selenium and Beautiful Soup
[英]Using Beautiful Soup and Selenium to Insert Data into CSV
我正在使用beautifulsoup
和selenium
來抓取 python 中的一些數據。這是我通過 url https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10
運行的代碼https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10
:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
DRIVER_PATH = '$PATH/chromedriver.exe'
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
class_name = "matchHistoryRow__dartThrows"
def write_to_output(url):
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
print(soup.find_all("div", {"class": class_name}))
return
這是我試圖抓取的模式——我想獲取冒號之間的一對跨度並將它們放入 csv 的單獨列中,問題是class
出現在冒號之前或之后,所以我不是確定如何 go 了解如何執行此操作。 例如:
<div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
<span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
title="140+ thrown">140+</span></span>, <span><span
class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>
我希望在 csv 中以這種方式表示:
player_1_score,player_2_score
321,501
321,361
224,361
go 的最佳方式是什么?
您可以使用正則表達式來解析分數(最簡單的方法,如果文本結構相應):
import re
import pandas as pd
from bs4 import BeautifulSoup
html_doc = """
<div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
<span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
title="140+ thrown">140+</span></span>, <span><span
class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
# 1. parse whole text from a row
txt = soup.select_one(".matchHistoryRow__dartThrows").get_text(
strip=True, separator=" "
)
# 2. find scores with regex
scores = re.findall(r"(\d+)\s+:\s+(\d+)", txt)
# 3. create dataframe from regex
df = pd.DataFrame(scores, columns=["player_1_score", "player_2_score"])
print(df)
df.to_csv("data.csv", index=False)
印刷:
player_1_score player_2_score
0 321 501
1 321 361
2 224 361
這data.csv
(來自 LibreOffice 的屏幕截圖):
另一種方法,不使用re
:
scores = [
s.get_text(strip=True)
for s in soup.select(
".matchHistoryRow__dartThrows > span > span:nth-of-type(1), .matchHistoryRow__dartThrows > span > span:nth-of-type(2)"
)
]
df = pd.DataFrame(
{"player_1_score": scores[::2], "player_2_score": scores[1::2]}
)
print(df)
使用Selenium和player_1_score 的 css-selectors你需要span:first-child
和player_2_score你需要span:nth-child(2)
。 因此,您可以使用以下解決方案:
driver.get('https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
player_1_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:first-child")))[:3]]
player_2_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:nth-child(2)")))[:3]]
df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
print(df)
控制台 Output:
player_1_score player_2_score
0 501 321
1 361 321
2 361 181
寫入CSV :
df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
df.to_csv("my_data.csv", index=False)
快照:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.