简体   繁体   中英

BeautifulSoup not extracting specific tag text

I'm having a problem harvesting the information for a specific tag using BeautifulSoup. I would like to extract the text for 'Item 4' between the tag html, but the code below gets the text related to 'Item 1.' What am I doing incorrect(eg, slicing)?

Code:

primary_detail = page_section.findAll('div', {'class': 'detail-item'})
for item_4 in page_section.find('h3', string='Item 4'):
  if item_4:
    for item_4_content in page_section.find('html'):
      print (item_4_content)

HTML:

<div class="detail-item">
   <h3>Item 1</h3>
   <html><body><p>Item 1 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 2</h3>
   <html><body><p>Item 2 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 3</h3>
   <html><body><p>Item 3 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 4</h3>
   <html><body><p>Item 4 text here</p></body></html>
</div>

It looks like you want to print the <p> tag content according to <h3> text value, correct?

Your code must:

  1. load a html_source
  2. search for all 'div' tags that contains a 'class' equal to 'detail-item'
  3. for each occurrence, if the .text value of <h3> tag is equal to the string 'Item 4'
  4. then the code will print the .text value of the corresponding <p> tag

You can achieve what you want by using the following code.

Code:

s = '''<div class="detail-item">
   <h3>Item 1</h3>
   <html><body><p>Item 1 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 2</h3>
   <html><body><p>Item 2 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 3</h3>
   <html><body><p>Item 3 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 4</h3>
   <html><body><p>Item 4 text here</p></body></html>
</div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, 'lxml')

primary_detail = soup.find_all('div', {'class': 'detail-item'})

for tag in primary_detail:
    if 'Item 4' in tag.h3.text:
        print(tag.p.text)

Output:

'Item 4 text here'

EDIT: In the provided website the first loop occurence don't have a <h3> tag, only a <h2> so it won't have any .text value, correct?

You can bypass this error using a try/except clause, like in the following code..

Code:

from bs4 import BeautifulSoup
import requests


url = 'https://fortiguard.com/psirt/FG-IR-17-097'
html_source = requests.get(url).text

soup = BeautifulSoup(html_source, 'lxml')

primary_detail = soup.find_all('div', {'class': 'detail-item'})

for tag in primary_detail:
    try:
        if 'Solutions' in tag.h3.text:
            print(tag.p.text)
    except:
        continue

If the code faces an exception, it'll continue the iteration with the next element in the loop. So the code will ignore the first item that don't contain any .text value.

Output:

'Upgrade to FortiWLC-SD version 8.3.0'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM