Using page text to select `html` element using`Beautiful Soup`

Question

I have a page which contains several repetitions of: <div...><h4>...<p>... For example:

html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

If I write print soup.select('div[class^="proletariat"] > h4 ~ p') , I get:

[<p>Ignore this text</p>, <p>This is the text we want</p>]

How do I specify that I only want the text of p when it is preceded by <h4>hammer</h4> ?

Thanks

Answer 1

html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want

Answer 2

:contains() could help here, but it is not supported.

Taking this into account, you can use select() in conjunction with the find_next_sibling() :

print next(h4.find_next_sibling('p').text 
           for h4 in soup.select('div[class^="proletariat"] > h4') 
           if h4.text == "hammer")

Using page text to select `html` element using`Beautiful Soup`

Question

2 answers

solution1
1 ACCPTED 2014-11-26 23:52:59

solution2
1 2014-11-26 23:57:35

Using page text to select `html` element using`Beautiful Soup`

Question

2 answers

solution1 1 ACCPTED 2014-11-26 23:52:59

solution2 1 2014-11-26 23:57:35

solution1
1 ACCPTED 2014-11-26 23:52:59

solution2
1 2014-11-26 23:57:35