刮 HTML 表 python bs4

Question

我会从 Transfermarkt 玩家资料页面中的两个 html 表中抓取数据。 这是页面示例： https://www.transfermarkt.com/cristiano-ronaldo/profil/spieler/8198

第一个是“事实和数据”表，第二个是“统计”表。 我想从搜索页面开始抓取并获取网址。 一旦我从搜索页面的每一页获得了 url，就开始为每个播放器链接抓取统计信息。

如何从该链接中抓取 html 表的数据？

这是我的完整代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

url_page="https://www.transfermarkt.com/detailsuche/spielerdetail/suche/27403221"

response = requests.get(url=url_page,
                            headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
response.elapsed.seconds
soup = BeautifulSoup(response.content, "html.parser")


for link in soup.find_all('table',class_='items'):
    for link_pag in link.find_all(class_='spielprofil_tooltip'):
        #add page loop
        url_page="https://www.transfermarkt.com"+link_pag.attrs["href"]
        response_pagina = requests.get(url=url_page,
                            headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
        soup_pagina = BeautifulSoup(response_pagina.content, "html.parser")
        time.sleep(3)

        for n_player in soup_pagina('h1', itemprop="name"):
            name = n_player.text
        for value_player in soup_pagina('span', class_="waehrung"):
            price = value_player.text

        data_table = soup_pagina.find('table', class_='auflistung')
        for data in data_table.find_all('tbody'):
            rows = data.find_all('tr')
            for row in rows:
                try:
                    date_of_birth = row.find('td', [1]).text
                except:
                    date_of_birth = ""
                place_of_birth = row.find('td', [2]).text
                age = row.find('td', [3]).text
                height = row.find('td', [4]).text
                citizenship = row.find('td', [5]).text
                position = row.find('td', [6]).text
                foot = row.find('td', [7]).text
                agent = row.find('td', [8]).text
                club = row.find('td', [9]).text
                joined = row.find('td', [10]).text
                contract_expired = row.find('td', [11]).text
                contract_extension = row.find('td', [12]).text

        stats_table = soup_pagina.find('table', class_='items')
        for stats in stats_table.find_all('tfoot'):
            rows_s = stats.find_all('td'):
                for row_s in rows_s:
                    total = row.find('td', [3]).text
                    goal = row.find('td', [4]).text
                    assist = row.find('td', [5]).text
                    goal_per_min = row.find('td', [6]).text
                    total_min = row.find('td', [7]).text


        data_stats = {
            'name': name,
            'price': price,
            'data_of_birth': data_of_birth,
            'place_of_birth': place_of_birth,
            'age': age,
            'height': height,
            'citizenship': citizenship,
            'position': position,
            'foot': foot,
            'agent': agent,
            'club': club,
            'joined': joined,
            'contract_expired': contract_expired,
            'contract_extension': contract_extension,
            
        }
        players_stats.append(data_stats)

players_stats = []
                
df = pd.DataFrame(players_stats)
print(df.head())
df.to_csv('players.csv', index=False)

Answer 1

您可以使用此示例如何从播放器页面获取数据并从中创建 DataFrame（当然，它需要您的修改）：

import requests
import pandas as pd
from bs4 import BeautifulSoup


def get_player(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
    }
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )

    data = {}
    # data:
    for th in soup.select(".auflistung th"):
        data[th.text.split(":")[0].strip()] = th.find_next("td").get_text(
            strip=True
        )

    # stats:
    for tr in soup.select(".items tr.odd, .items tr.even"):
        row = [td.get_text(strip=True) for td in tr.select("td")[1:]]
        data[row[0]] = row[1:]

    return pd.DataFrame([data])


df1 = get_player(
    "https://www.transfermarkt.com/cristiano-ronaldo/profil/spieler/8198"
)
df2 = get_player(
    "https://www.transfermarkt.com/wojciech-szczesny/profil/spieler/44058"
)

df = pd.concat([df1, df2])

print(df)
df.to_csv("data.csv", index=False)

印刷：

                             Full name Date of birth Place of birth Age  Height Citizenship              Position   Foot        Player agent Current club        Joined Contract expires Outfitter Social-Media                 Serie A     Champions League        Coppa Italia Supercoppa Italiana      Name in home country Date of last contract extension
0  Cristiano Ronaldo dos Santos Aveiro   Feb 5, 1985        Funchal  36  1,87 m    Portugal  attack - Left Winger   both           Gestifute  Juventus FC  Jul 10, 2018     Jun 30, 2022      Nike               [27, 25, 2, 91, 2.283]  [6, 4, 2, 143, 570]  [3, 2, -, 99, 198]   [1, 1, -, 90, 90]                       NaN                             NaN
0                                  NaN  Apr 18, 1990       Warszawa  30  1,96 m      Poland            Goalkeeper  right  ICM Stellar Sports  Juventus FC  Jul 19, 2017     Jun 30, 2024      Nike                   [24, 24, 5, 2.160]       [7, 8, 2, 660]        [0, -, -, -]       [1, -, 1, 90]  Wojciech Tomasz Szczęsny                    Feb 11, 2020

并保存data.csv （来自 LibreOffice 的屏幕截图）：

刮 HTML 表 python bs4

问题描述

1 个解决方案

解决方案1
0 2021-04-13 17:27:15

刮 HTML 表 python bs4

问题描述

1 个解决方案

解决方案1 0 2021-04-13 17:27:15

解决方案1
0 2021-04-13 17:27:15