简体   繁体   中英

beautifulsoup extract by class value text

I want to extract paragraph data based on h2 class value. Below is html code.

<div class="myClass">
<div itemprop="reviewBody" class="review-body">
<h2 class="h3">Test1</h2><p>I want to extract this</p>
<h2 class="h3">Test2</h2><p>Dont want to extract</p>
<h2 class="h3">Test3</h2><p>I want to extract this too</p>
< /div>
< /div>

Output should look like

Test 1    | I want to extract this
Test 3    | I want to extract this too

Below is my code, but it extracts all tests(Test1, test2, test3). How to extract data based on h2 text?

soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="myClass")

test1= [] 

for item in divs[0].find_all('h2',class_="h3"):
    test1.append(item.text.strip())
print(test1)

If I understand correctly, you'd like to apply an additional condition on the h2 text. You can use text argument of the .find_all() , which could hold a list of texts you want to match, eg:

for h2 in soup.find_all('h2', class_='h3', text=['Test1', 'Test3']):
    print(h2.get_text())

If you want to additionally get to the following paragraph, you could use find_next_sibling() :

for h2 in soup.find_all('h2', class_='h3', text=['Test1', 'Test3']):
    print(h2.find_next_sibling('p').get_text())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM