簡體   English   中英

從 HTML 表中抓取足球數據

[英]Scraping soccer data from HTML table

我需要從這個網站的 HTML 表中提取賠率數據: http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1

我想提取每場比賽的賠率問題是每場比賽都在 2 行(打開和關閉)。

我創建了這段代碼,但返回一個空的 dataframe

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
import pandas as pd
import copy
import numpy as np
import time

results = []


d = webdriver.Chrome(executable_path = r'C:\chromedriver.exe')

u = "http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1"

d.get(u)
WebDriverWait(d, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main > div.pl_right > table")))

soup = bs(d.page_source, 'lxml')
rows = soup.select('#main > div.pl_right > table')

headers = ['Comp', 'Time', 'Match' ,'Odds', 'H','D', 'A', 'Res']
i = 1
for row in rows[1:]:    
    cols = [td.text for td in row.select('td')]

    if (i % 2 == 1):
        record = {'Comp' : cols[0],
                  'Time' : cols[1],
                  'Match' : ' v '.join([cols[2], cols[10]]),
                  'Odds' : 'op',
                  'H' : cols[3],
                  'D' : cols[4],
                  'A' : cols[5],
                  'Res' : cols[11]}
    else:
        record['Odds'] = 'cl'
        record['H'] = cols[0] 
        record['D'] = cols[1] 
        record['A'] = cols[2]
    results.append(copy.deepcopy(record))
    i+=1

df = pd.DataFrame(results, columns = headers)
d.quit()

此腳本提取表並將信息放入列表中:

import re
import requests
from bs4 import BeautifulSoup


url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

all_data = []
for tr in soup.select('.schedule tr[id^="tr"]')[::2]:
    row1 = [td.get_text(strip=True) for td in tr.select('td')]
    row2 = [td.get_text(strip=True) for td in tr.find_next('tr').select('td')]
    #extract date form <script> tag:
    row1[1] = re.findall(r'\d+,\d+,\d+(?=\))', tr.select('td')[1].script.contents[0])[0]

    row1 = row1[:3] + row1[3:10] + row2 + row1[10:-1]
    all_data.append(row1)

# print on screen:
from pprint import pprint
pprint(all_data, width=250)

印刷:

[['KFAC', '04,00,00', 'Gyeongju Citizen', '1.94', '3.48', '3.57', '47.60%', '26.54%', '25.87%', '92.34%', '1.83', '3.53', '3.94', '50.43%', '26.14%', '23.42%', '92.29%', 'Pyeongtaek Citizen', '0-0'],
 ['KFAC', '05,00,00', 'Paju Citizen FC', '2.87', '3.09', '2.43', '32.16%', '29.87%', '37.98%', '92.30%', '2.89', '3.06', '2.44', '31.96%', '30.18%', '37.85%', '92.36%', 'Gimpo FC', '2-2'],
 ['KFAC', '06,00,00', 'FC Anyang', '1.26', '5.18', '9.07', '72.35%', '17.60%', '10.05%', '91.16%', '1.33', '4.65', '7.67', '68.52%', '19.60%', '11.88%', '91.13%', 'Goyang FC', '2-0'],
 ['KFAC', '06,00,00', 'Jeju United', '1.09', '8.40', '20.12', '84.46%', '10.96%', '4.58%', '92.06%', '1.09', '8.40', '19.82', '84.41%', '10.95%', '4.64%', '92.01%', 'Songwol', '4-0'],
 ['KFAC', '06,00,00', 'Jeonnam Dragons', '1.14', '6.76', '14.22', '80.08%', '13.50%', '6.42%', '91.29%', '1.15', '6.51', '13.68', '79.32%', '14.01%', '6.67%', '91.22%', 'Chungju Citizen', '2-0'],
 ['KFAC', '07,00,00', 'Hwaseong FC', '2.71', '3.14', '2.53', '34.08%', '29.41%', '36.51%', '92.36%', '2.85', '2.98', '2.52', '32.39%', '30.98%', '36.63%', '92.31%', 'Daejeon Korail', '2-2'],
 ['KFAC', '07,00,00', 'Suwon City', '1.13', '7.00', '14.95', '80.84%', '13.05%', '6.11%', '91.35%', '1.13', '7.09', '15.16', '81.04%', '12.92%', '6.04%', '91.58%', 'Hyochang FC', '10-0'],
 ['KOR D1', '07,30,00', 'FC Seoul', '4.24', '3.39', '1.95', '22.60%', '28.26%', '49.14%', '95.82%', '5.03', '3.65', '1.76', '19.10%', '26.32%', '54.58%', '96.07%', 'Jeonbuk Hyundai Motors', '1-4'],
 ['INT CF', '08,00,00', 'Bohemians1905 B', '1.96', '4.27', '3.27', '48.58%', '22.30%', '29.12%', '95.22%', '', '', '', '', '', '', '', 'Slavia Prague B', '0-5'],
 ['INT CF', '08,00,00', 'Sepsi', '2.03', '3.24', '3.21', '44.27%', '27.74%', '28.00%', '89.87%', '1.68', '3.70', '3.98', '53.30%', '24.20%', '22.50%', '89.54%', 'Chindia Targoviste', '2-1'],
 ['KFAC', '08,00,00', 'Gyeongju KHNP', '1.23', '5.26', '10.32', '73.91%', '17.28%', '8.81%', '90.91%', '1.17', '6.07', '13.34', '78.10%', '15.05%', '6.85%', '91.38%', 'SMC Engineering', '4-0'],
 ['VIE U19', '08,00,00', 'Becamex Binh Duong U19', '1.15', '6.08', '10.27', '76.86%', '14.54%', '8.61%', '88.39%', '1.17', '5.78', '9.71', '75.59%', '15.30%', '9.11%', '88.44%', 'Can Tho U19', '6-0'],
 ['VIE U19', '08,00,00', 'Dong Tam Long An U19', '4.22', '3.64', '1.65', '21.20%', '24.58%', '54.22%', '89.46%', '4.48', '3.63', '1.61', '19.93%', '24.60%', '55.47%', '89.29%', 'Sai Gon FC U19', '1-2'],
 ['INT CF', '08,15,00', 'Admira Praha', '2.34', '3.99', '2.37', '38.85%', '22.79%', '38.36%', '90.91%', '2.27', '4.09', '2.41', '40.05%', '22.23%', '37.72%', '90.91%', 'Loko Vltavin', '3-1'],

... and so on.

當 pandas 可以使用.read_html()為您解析表格時,這是一項非常艱巨的工作。 它在引擎蓋下使用 BeautifulSoup。

另外,我假設打開的是第一行,關閉的是第二行。 所以這只是通過偶數/奇數索引值切片的問題:

import pandas as pd
import requests

url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]

evenRows = list(df.index)[::2]
oddRows = list(df.index)[1::2]

open_df = df.take(evenRows)
close_df = df.take(oddRows)

Output:

print (open_df.head(10).to_string())
    League                             Time              Home    HW     D     AW     HWR      DR     AWR  Return                    Away Score
0     KFAC  showtime(2020,06-1,06,04,00,00)  Gyeongju Citizen  1.94  3.48   3.57  47.60%  26.54%  25.87%  92.34%      Pyeongtaek Citizen   0-0
2     KFAC  showtime(2020,06-1,06,05,00,00)   Paju Citizen FC  2.87  3.09   2.43  32.16%  29.87%  37.98%  92.30%                Gimpo FC   2-2
4     KFAC  showtime(2020,06-1,06,06,00,00)         FC Anyang  1.26  5.18   9.07  72.35%  17.60%  10.05%  91.16%               Goyang FC   2-0
6     KFAC  showtime(2020,06-1,06,06,00,00)       Jeju United  1.09  8.40  20.12  84.46%  10.96%   4.58%  92.06%                 Songwol   4-0
8     KFAC  showtime(2020,06-1,06,06,00,00)   Jeonnam Dragons  1.14  6.76  14.22  80.08%  13.50%   6.42%  91.29%         Chungju Citizen   2-0
10    KFAC  showtime(2020,06-1,06,07,00,00)       Hwaseong FC  2.71  3.14   2.53  34.08%  29.41%  36.51%  92.36%          Daejeon Korail   2-2
12    KFAC  showtime(2020,06-1,06,07,00,00)        Suwon City  1.13  7.00  14.95  80.84%  13.05%   6.11%  91.35%             Hyochang FC  10-0
14  KOR D1  showtime(2020,06-1,06,07,30,00)          FC Seoul  4.24  3.39   1.95  22.60%  28.26%  49.14%  95.82%  Jeonbuk Hyundai Motors   1-4
16  INT CF  showtime(2020,06-1,06,08,00,00)   Bohemians1905 B  1.96  4.27   3.27  48.58%  22.30%  29.12%  95.22%         Slavia Prague B   0-5
18  INT CF  showtime(2020,06-1,06,08,00,00)             Sepsi  2.03  3.24   3.21  44.27%  27.74%  28.00%  89.87%      Chindia Targoviste   2-1
....

或者,您似乎想要完整表格並輸入'op''cl' ,只需對代碼稍作修改:

import pandas as pd
import requests

url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
df = df.drop(['Compare'],axis=1)
df['Odds'] = 'op'
df.loc[1::2,'Odds'] = 'cl'

在表的 bs4 CSS 選擇器中發現錯誤

soup.select('#main > div.pl_right > table > tbody > tr')

我查看了您的代碼,發現有許多情況/條件未處理。

就像您沒有處理日期<tr>標簽一樣。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM