简体   繁体   中英

Beautiful Soup: Separating out span element from p element

I need to pull out a span element from my total p element

Here is a specific example of one of the p elements I am parsing

<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among 
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly 
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) 
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>) 
   values varied from 40 50% egg infective doses (EID<sub>50</sub>) 
   for the influenza virus-susceptible strain DBA/2<sub>S</sub> 
   (susceptibility indicated by “S”) to more than 10<sup>6</sup> 
   EID<sub>50</sub> for the influenza virus-resistant strains 
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub> 
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1- 
   1">Fig. 1</a>).
</p>

If I were to take the variable paragraph as a bs4.element.Tag and do this

print(paragraph.text)

The result is

H5N1 virus pathogenic phenotypes among inbred mouse strains.We experimentally
inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus
A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter 
for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50) 
values varied from 40 50% egg infective doses (EID50) for the influenza 
virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more 
than 106 EID50 for the influenza virus-resistant strains BALB/cR and 
BALB/cByR (resistance indicated by “R”) (Fig. 1).

As you can see in the first and second sentence it doesn't create a space between the text in the span and the text in the rest of the paragraph.

It ends up looking something like:

"H5N1 virus pathogenic phenotypes among inbred mouse strains.We experimentally..."

As you can see, this results in 2 separate sentences not having a space after the period, which is a big deal since I'm going to be splitting by sentence later, and most sentence spliters delimit with a period and a space and most of my other sentences are formed properly.

Is there any way that I can isolate out the text in the span from the rest of the text with bs4, and then concatenate them together afterword with the proper spacing?

Another solution:

import re
from bs4 import BeautifulSoup


txt = '''<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
   values varied from 40 50% egg infective doses (EID<sub>50</sub>)
   for the influenza virus-susceptible strain DBA/2<sub>S</sub>
   (susceptibility indicated by “S”) to more than 10<sup>6</sup>
   EID<sub>50</sub> for the influenza virus-resistant strains
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
   1">Fig. 1</a>).
</p>'''

soup = BeautifulSoup(txt, 'html.parser')
paragraph = soup.select_one('p')

# add space at the end of each span:
for span in paragraph.select('span'):
    span.append(BeautifulSoup('&nbsp;', 'html.parser'))

# post-process the text:
print(re.sub(r'\s{2,}', ' ', paragraph.text).strip())

Prints:

H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50) values varied from 40 50% egg infective doses (EID50) for the influenza virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more than 106 EID50 for the influenza virus-resistant strains BALB/cR and BALB/cByR (resistance indicated by “R”) (Fig. 1).

I am assuming you are using get_result() . You can do an alternative in bs4 called strings . This gives an array of all strings in a soup. Then you can join them together to get properly formatted text:

from bs4 import BeautifulSoup

html_doc = """
<p>
    <span>Some Text.</span>
    Some text and probably other stuff.
</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

print(" ".join(soup.strings))
print(" ".join(soup.stripped_strings))

Also, I see in your example you have a lot of whitespace for formatting. You can get rid of those by doing stripped_strings instead

Try:

import re
from bs4 import BeautifulSoup
html = '''
<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among 
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly 
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) 
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>) 
   values varied from 40 50% egg infective doses (EID<sub>50</sub>) 
   for the influenza virus-susceptible strain DBA/2<sub>S</sub> 
   (susceptibility indicated by “S”) to more than 10<sup>6</sup> 
   EID<sub>50</sub> for the influenza virus-resistant strains 
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub> 
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1- 
   1">Fig. 1</a>).
</p>
'''

soup = BeautifulSoup(html, 'lxml')

p = soup.select('p')

for text in p:
    para = text.get_text(' ').replace('\n','')
para = re.sub(' +', ' ', para)
print(para.strip())

prints:

H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse...

and so on..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM