简体   繁体   中英

Dynamically extract text from webpage using Python BeautifulSoup

I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon ). I'm able to extract Malcolm Brogdon's position using the following code:

player_id = 'malcolm-brogdon-1'

# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np

url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")

pos = page_soup.p.find("strong").next_sibling.strip()
pos

However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (ie Cat Barber ).

I've tried doing something like page_soup.find("strong", text="Position:") but that doesn't seem to work.

Malcolm Brogdon 的运动参考网页

You can select the element that contains the text "Position:" and then the next text sibling:

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)

Prints:

Guard

EDIT: Another version:

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = (
    soup.find("strong", text=lambda t: "Position" in t)
    .find_next_sibling(text=True)
    .strip()
)
print(pos)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM