简体   繁体   中英

BeautifulSoup - Get text after <em> tag

I am identifying the <strong> tag for the headers. However, whenever I try to grab the rest of the information to identify it as "info" I am only getting back <em>Parade </em> rather than everything else in the <p> tag.

Here is my code:

<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>

for strong_tag in soup.find_all('strong'):
    headers = strong_tag.text.replace(':', '').replace('\xa0', ' ').strip()

    info = strong_tag.next_sibling

    headerList.append(headers)
    infoList.append(info)

print(headerList)
print(infoList)

I think this is what you are looking for. It finds the parent p element, converts the soup object to a string, removes the strong element, then converts the string back to a soup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>", 'html.parser')
headerList = []
infoList = []

for strong_tag in soup.findAll('strong'):
    parent = strong_tag.find_parent('p')
    content = str(parent).replace(f'{strong_tag}', '')
    souped_content = BeautifulSoup(content, 'html.parser')
    infoList.append(souped_content)
    headerList.append(strong_tag)

print(headerList)
print(infoList)

This outputs the following:

[<strong>High School Honors: </strong>]
[<p><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>]

EDIT

You can also work with the contents , but have to iterate over all the NavigableStrings :

info = ''
for text in soup.p.contents[1:]:
    if isinstance(text, NavigableString):
        info+=text
    else:
        info+= text.get_text()

Example

from bs4 import BeautifulSoup, NavigableString

html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'html.parser')

headers = soup.p.strong.get_text().replace(':', '')

info = ''
for text in soup.p.contents[1:]:
    if isinstance(text, NavigableString):
        info+=text
    else:
        info+= text.get_text()
print(headers)
print(info)

Output

High School Honors 
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.

Use get_text() and split() :

headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()

Example

from bs4 import BeautifulSoup

html='''
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
'''
soup = BeautifulSoup(html,'lxml')

headers = soup.p.get_text(strip=True).split(':')[0]
info = soup.p.get_text().split(':')[1].strip()

print(headers)
print(info)

Output

High School Honors 
ParadeAll-American; Chicago Sun-TimesIllinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM