使用 Python BeautifulSoup 從網頁中動態提取文本

Question

我正在嘗試從許多玩家的網頁中提取玩家 position（這是Malcolm Brogdon的示例）。 我可以使用以下代碼提取 Malcolm Brogdon 的 position：

player_id = 'malcolm-brogdon-1'

# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np

url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")

pos = page_soup.p.find("strong").next_sibling.strip()
pos

但是，我希望能夠以更動態的方式執行此操作（即找到“位置：”，然后找到后面的內容）。 還有其他玩家的網頁結構略有不同，我當前的代碼不會返回 position （即Cat Barber ）。

我嘗試過類似page_soup.find("strong", text="Position:")的操作，但這似乎不起作用。

Answer 1

您可以 select 包含文本“Position：”的元素，然后是下一個文本兄弟：

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)

印刷：

Guard

編輯：另一個版本：

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = (
    soup.find("strong", text=lambda t: "Position" in t)
    .find_next_sibling(text=True)
    .strip()
)
print(pos)

使用 Python BeautifulSoup 從網頁中動態提取文本

問題描述

1 個解決方案

解決方案1
1 已采納 2020-08-06 04:29:49

使用 Python BeautifulSoup 從網頁中動態提取文本

問題描述

1 個解決方案

解決方案1 1 已采納 2020-08-06 04:29:49

解決方案1
1 已采納 2020-08-06 04:29:49