I am identifying the <strong>
tag for the headers. However, whenever I try to grab the rest of the information to identify it as "info" I am only getting back <em>Parade </em>
rather than everything else in the <p>
tag.
Here is my code:
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
for strong_tag in soup.find_all('strong'):
headers = strong_tag.text.replace(':', '').replace('\xa0', ' ').strip()
info = strong_tag.next_sibling
headerList.append(headers)
infoList.append(info)
print(headerList)
print(infoList)
I think this is what you are looking for. It finds the parent p element, converts the soup object to a string, removes the strong element, then converts the string back to a soup object.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>", 'html.parser')
headerList = []
infoList = []
for strong_tag in soup.findAll('strong'):
parent = strong_tag.find_parent('p')
content = str(parent).replace(f'{strong_tag}', '')
souped_content = BeautifulSoup(content, 'html.parser')
infoList.append(souped_content)
headerList.append(strong_tag)
print(headerList)
print(infoList)
This outputs the following:
[<strong>High School Honors: </strong>]
[<p><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>]
You can also work with the contents
, but have to iterate over all the NavigableStrings
:
info = ''
for text in soup.p.contents[1:]:
if isinstance(text, NavigableString):
info+=text
else:
info+= text.get_text()
Example
from bs4 import BeautifulSoup, NavigableString
html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'html.parser')
headers = soup.p.strong.get_text().replace(':', '')
info = ''
for text in soup.p.contents[1:]:
if isinstance(text, NavigableString):
info+=text
else:
info+= text.get_text()
print(headers)
print(info)
Output
High School Honors
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.
Use get_text()
and split()
:
headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()
Example
from bs4 import BeautifulSoup
html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'lxml')
headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()
print(headers)
print(info)
Output
High School Honors
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.