[英]Scraping HTML table python bs4
我会从 Transfermarkt 玩家资料页面中的两个 html 表中抓取数据。 这是页面示例: https://www.transfermarkt.com/cristiano-ronaldo/profil/spieler/8198
第一个是“事实和数据”表,第二个是“统计”表。 我想从搜索页面开始抓取并获取网址。 一旦我从搜索页面的每一页获得了 url,就开始为每个播放器链接抓取统计信息。
如何从该链接中抓取 html 表的数据?
这是我的完整代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
url_page="https://www.transfermarkt.com/detailsuche/spielerdetail/suche/27403221"
response = requests.get(url=url_page,
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
response.elapsed.seconds
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all('table',class_='items'):
for link_pag in link.find_all(class_='spielprofil_tooltip'):
#add page loop
url_page="https://www.transfermarkt.com"+link_pag.attrs["href"]
response_pagina = requests.get(url=url_page,
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
soup_pagina = BeautifulSoup(response_pagina.content, "html.parser")
time.sleep(3)
for n_player in soup_pagina('h1', itemprop="name"):
name = n_player.text
for value_player in soup_pagina('span', class_="waehrung"):
price = value_player.text
data_table = soup_pagina.find('table', class_='auflistung')
for data in data_table.find_all('tbody'):
rows = data.find_all('tr')
for row in rows:
try:
date_of_birth = row.find('td', [1]).text
except:
date_of_birth = ""
place_of_birth = row.find('td', [2]).text
age = row.find('td', [3]).text
height = row.find('td', [4]).text
citizenship = row.find('td', [5]).text
position = row.find('td', [6]).text
foot = row.find('td', [7]).text
agent = row.find('td', [8]).text
club = row.find('td', [9]).text
joined = row.find('td', [10]).text
contract_expired = row.find('td', [11]).text
contract_extension = row.find('td', [12]).text
stats_table = soup_pagina.find('table', class_='items')
for stats in stats_table.find_all('tfoot'):
rows_s = stats.find_all('td'):
for row_s in rows_s:
total = row.find('td', [3]).text
goal = row.find('td', [4]).text
assist = row.find('td', [5]).text
goal_per_min = row.find('td', [6]).text
total_min = row.find('td', [7]).text
data_stats = {
'name': name,
'price': price,
'data_of_birth': data_of_birth,
'place_of_birth': place_of_birth,
'age': age,
'height': height,
'citizenship': citizenship,
'position': position,
'foot': foot,
'agent': agent,
'club': club,
'joined': joined,
'contract_expired': contract_expired,
'contract_extension': contract_extension,
}
players_stats.append(data_stats)
players_stats = []
df = pd.DataFrame(players_stats)
print(df.head())
df.to_csv('players.csv', index=False)
您可以使用此示例如何从播放器页面获取数据并从中创建 DataFrame(当然,它需要您的修改):
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_player(url):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
data = {}
# data:
for th in soup.select(".auflistung th"):
data[th.text.split(":")[0].strip()] = th.find_next("td").get_text(
strip=True
)
# stats:
for tr in soup.select(".items tr.odd, .items tr.even"):
row = [td.get_text(strip=True) for td in tr.select("td")[1:]]
data[row[0]] = row[1:]
return pd.DataFrame([data])
df1 = get_player(
"https://www.transfermarkt.com/cristiano-ronaldo/profil/spieler/8198"
)
df2 = get_player(
"https://www.transfermarkt.com/wojciech-szczesny/profil/spieler/44058"
)
df = pd.concat([df1, df2])
print(df)
df.to_csv("data.csv", index=False)
印刷:
Full name Date of birth Place of birth Age Height Citizenship Position Foot Player agent Current club Joined Contract expires Outfitter Social-Media Serie A Champions League Coppa Italia Supercoppa Italiana Name in home country Date of last contract extension
0 Cristiano Ronaldo dos Santos Aveiro Feb 5, 1985 Funchal 36 1,87 m Portugal attack - Left Winger both Gestifute Juventus FC Jul 10, 2018 Jun 30, 2022 Nike [27, 25, 2, 91, 2.283] [6, 4, 2, 143, 570] [3, 2, -, 99, 198] [1, 1, -, 90, 90] NaN NaN
0 NaN Apr 18, 1990 Warszawa 30 1,96 m Poland Goalkeeper right ICM Stellar Sports Juventus FC Jul 19, 2017 Jun 30, 2024 Nike [24, 24, 5, 2.160] [7, 8, 2, 660] [0, -, -, -] [1, -, 1, 90] Wojciech Tomasz Szczęsny Feb 11, 2020
并保存data.csv
(来自 LibreOffice 的屏幕截图):
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.