I have a page which contains several repetitions of: <div...><h4>...<p>...
For example:
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I write print soup.select('div[class^="proletariat"] > h4 ~ p')
, I get:
[<p>Ignore this text</p>, <p>This is the text we want</p>]
How do I specify that I only want the text of p when it is preceded by <h4>hammer</h4>
?
Thanks
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want
:contains()
could help here, but it is not supported.
Taking this into account, you can use select()
in conjunction with the find_next_sibling()
:
print next(h4.find_next_sibling('p').text
for h4 in soup.select('div[class^="proletariat"] > h4')
if h4.text == "hammer")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.