简体   繁体   中英

using bs4 to find a html tag (h2) having text

for this part of html code:

html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""

I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code:

soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings

This is an error:

AttributeError: 'NoneType' object has no attribute 'nextsibilings'

There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it.

The main problem is the way you are locating the h2 element to find siblings from. I'd use a function instead checking that Content Logical Definition is inside the text:

soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

Also, to get the next siblings you should use the .next_siblings and not nextsibilings .

Demo:

>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
... 
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>

Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next h2 or if you find a table before that. Actual implementation:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM