How do I get only the part of an HTML tree which is above a certain tag with certain string BeautifulSoup?

Question

I have an HTML tree and only want a certain part of it. Ie I only want the part of an HTML tree which is above a certain tag with string. The example contains only one b tag with Notes as string but there could be several.

<br/>
Hello
<br/>
<b>
 Notes
</b>
<br/>
Hello
<a name="test">
  Hello2
</a>

should become

<br/>
Hello
<br/>

With my code I only get the desired output as list but not as HTML tree.

#book.html contains the example from above
openHtml = open('book.html', 'r')
soup = BeautifulSoup(openHtml, 'html.parser')
all=soup.find_all('b')
for i in all:
    if i.text.strip() == 'Notes':
        pos = all.index(i)
soup = soup.find_all("b")[pos].find_all_previous(string=True)
print(soup)

How can I get the same result as HTML and not as list?

Answer 1

Solution

I iterated over the list and removed every element after the desired tag and removed the tag itself from the end.

openHtml = open('book.html', 'r')
soup = BeautifulSoup(openHtml, 'html.parser')
all=soup.find_all('b')
for i in all:
    if i.text.strip() == 'Notes':
        pos = all.index(i)
for i in soup.find_all("b")[pos]:
    for j in i.find_all_next():
        j.extract()
soup.find_all('b')[-1].extract()
print(soup.prettify())

How do I get only the part of an HTML tree which is above a certain tag with certain string BeautifulSoup?

Question

1 answers

solution1
0 2022-09-14 21:03:45

How do I get only the part of an HTML tree which is above a certain tag with certain string BeautifulSoup?

Question

1 answers

solution1 0 2022-09-14 21:03:45

solution1
0 2022-09-14 21:03:45