Dynamically extract text from webpage using Python BeautifulSoup

Question

I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon ). I'm able to extract Malcolm Brogdon's position using the following code:

player_id = 'malcolm-brogdon-1'

# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np

url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")

pos = page_soup.p.find("strong").next_sibling.strip()
pos

However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (ie Cat Barber ).

I've tried doing something like page_soup.find("strong", text="Position:") but that doesn't seem to work.

Answer 1

You can select the element that contains the text "Position:" and then the next text sibling:

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)

Prints:

Guard

EDIT: Another version:

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = (
    soup.find("strong", text=lambda t: "Position" in t)
    .find_next_sibling(text=True)
    .strip()
)
print(pos)

Dynamically extract text from webpage using Python BeautifulSoup

Question

1 answers

solution1
1 ACCPTED 2020-08-06 04:29:49

Dynamically extract text from webpage using Python BeautifulSoup

Question

1 answers

solution1 1 ACCPTED 2020-08-06 04:29:49

solution1
1 ACCPTED 2020-08-06 04:29:49