简体   繁体   English

使用 Beautiful Soup 和 Selenium 向 CSV 插入数据

[英]Using Beautiful Soup and Selenium to Insert Data into CSV

I'm using beautifulsoup and selenium to scrape some data in python. Here is my code which I run through the url https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10 :我正在使用beautifulsoupselenium来抓取 python 中的一些数据。这是我通过 url https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10运行的代码https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

DRIVER_PATH = '$PATH/chromedriver.exe'

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

class_name = "matchHistoryRow__dartThrows"

def write_to_output(url):  
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    print(soup.find_all("div", {"class": class_name}))
    return

This is the schema I am trying to scrape- I would like to get the pair of spans between the colons and put them into separate columns on a csv, the problem is the class comes either before or after the colon, so I'm not sure how to go about doing this.这是我试图抓取的模式——我想获取冒号之间的一对跨度并将它们放入 csv 的单独列中,问题是class出现在冒号之前或之后,所以我不是确定如何 go 了解如何执行此操作。 For example:例如:

<div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
        <span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
            class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
            title="140+ thrown">140+</span></span>, <span><span
            class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>

I'd like this to be represented this way in a csv:我希望在 csv 中以这种方式表示:

player_1_score,player_2_score
321,501
321,361
224,361

What's the best way to go about this? go 的最佳方式是什么?

You can use regex to parse the scores (the easiest method, if the text is structured accordingly):您可以使用正则表达式来解析分数(最简单的方法,如果文本结构相应):

import re
import pandas as pd
from bs4 import BeautifulSoup


html_doc = """
<div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
        <span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
            class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
            title="140+ thrown">140+</span></span>, <span><span
            class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

# 1. parse whole text from a row
txt = soup.select_one(".matchHistoryRow__dartThrows").get_text(
    strip=True, separator=" "
)

# 2. find scores with regex
scores = re.findall(r"(\d+)\s+:\s+(\d+)", txt)

# 3. create dataframe from regex
df = pd.DataFrame(scores, columns=["player_1_score", "player_2_score"])
print(df)
df.to_csv("data.csv", index=False)

Prints:印刷:

  player_1_score player_2_score
0            321            501
1            321            361
2            224            361

This crates data.csv (screenshot from LibreOffice):data.csv (来自 LibreOffice 的屏幕截图):

在此处输入图像描述


Another method, without using re :另一种方法,不使用re

scores = [
    s.get_text(strip=True)
    for s in soup.select(
        ".matchHistoryRow__dartThrows > span > span:nth-of-type(1), .matchHistoryRow__dartThrows > span > span:nth-of-type(2)"
    )
]

df = pd.DataFrame(
    {"player_1_score": scores[::2], "player_2_score": scores[1::2]}
)

print(df)

Using Selenium and for player_1_score you need span:first-child and for player_2_score you need span:nth-child(2) .使用Seleniumplayer_1_score 你需要span:first-childplayer_2_score你需要span:nth-child(2) So you can use the following solution:因此,您可以使用以下解决方案:

driver.get('https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
player_1_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:first-child")))[:3]]
player_2_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:nth-child(2)")))[:3]]
df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
print(df)

Console Output:控制台 Output:

  player_1_score player_2_score
0            501            321
1            361            321
2            361            181

To write to a CSV :写入CSV

df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
df.to_csv("my_data.csv", index=False)

Snapshot:快照:

熊猫_csv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM