简体   繁体   中英

How do I get only the part of an HTML tree which is above a certain tag with certain string BeautifulSoup?

I have an HTML tree and only want a certain part of it. Ie I only want the part of an HTML tree which is above a certain tag with string. The example contains only one b tag with Notes as string but there could be several.

<br/>
Hello
<br/>
<b>
 Notes
</b>
<br/>
Hello
<a name="test">
  Hello2
</a>

should become

<br/>
Hello
<br/>

With my code I only get the desired output as list but not as HTML tree.

#book.html contains the example from above
openHtml = open('book.html', 'r')
soup = BeautifulSoup(openHtml, 'html.parser')
all=soup.find_all('b')
for i in all:
    if i.text.strip() == 'Notes':
        pos = all.index(i)
soup = soup.find_all("b")[pos].find_all_previous(string=True)
print(soup)

How can I get the same result as HTML and not as list?

Solution

I iterated over the list and removed every element after the desired tag and removed the tag itself from the end.

openHtml = open('book.html', 'r')
soup = BeautifulSoup(openHtml, 'html.parser')
all=soup.find_all('b')
for i in all:
    if i.text.strip() == 'Notes':
        pos = all.index(i)
for i in soup.find_all("b")[pos]:
    for j in i.find_all_next():
        j.extract()
soup.find_all('b')[-1].extract()
print(soup.prettify())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM