简体   繁体   English

BeautifulSoup不提取特定标签文本

[英]BeautifulSoup not extracting specific tag text

I'm having a problem harvesting the information for a specific tag using BeautifulSoup. 我在使用BeautifulSoup收集特定标签的信息时遇到问题。 I would like to extract the text for 'Item 4' between the tag html, but the code below gets the text related to 'Item 1.' 我想在标记html之间提取“项目4”的文本,但是下面的代码获取与“项目1”相关的文本。 What am I doing incorrect(eg, slicing)? 我做错了什么(例如切片)?

Code: 码:

primary_detail = page_section.findAll('div', {'class': 'detail-item'})
for item_4 in page_section.find('h3', string='Item 4'):
  if item_4:
    for item_4_content in page_section.find('html'):
      print (item_4_content)

HTML: HTML:

<div class="detail-item">
   <h3>Item 1</h3>
   <html><body><p>Item 1 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 2</h3>
   <html><body><p>Item 2 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 3</h3>
   <html><body><p>Item 3 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 4</h3>
   <html><body><p>Item 4 text here</p></body></html>
</div>

It looks like you want to print the <p> tag content according to <h3> text value, correct? 看来您想根据<h3>文本值打印<p>标记内容,对吗?

Your code must: 您的代码必须:

  1. load a html_source 加载html_source
  2. search for all 'div' tags that contains a 'class' equal to 'detail-item' 搜索包含等于'detail-item''class'所有'div'标签
  3. for each occurrence, if the .text value of <h3> tag is equal to the string 'Item 4' 对于每次出现,如果<h3>标记的.text值等于字符串'Item 4'
  4. then the code will print the .text value of the corresponding <p> tag 然后代码将print相应<p>标签的.text

You can achieve what you want by using the following code. 您可以使用以下代码来实现所需的功能。

Code: 码:

s = '''<div class="detail-item">
   <h3>Item 1</h3>
   <html><body><p>Item 1 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 2</h3>
   <html><body><p>Item 2 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 3</h3>
   <html><body><p>Item 3 text here</p></body></html>
</div>

<div class="detail-item">
   <h3>Item 4</h3>
   <html><body><p>Item 4 text here</p></body></html>
</div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, 'lxml')

primary_detail = soup.find_all('div', {'class': 'detail-item'})

for tag in primary_detail:
    if 'Item 4' in tag.h3.text:
        print(tag.p.text)

Output: 输出:

'Item 4 text here'

EDIT: In the provided website the first loop occurence don't have a <h3> tag, only a <h2> so it won't have any .text value, correct? 编辑:提供的网站中 ,第一个循环出现没有<h3>标签,只有<h2>所以它没有任何.text值,对吗?

You can bypass this error using a try/except clause, like in the following code.. 您可以使用try/except子句来绕过此错误,如以下代码所示。

Code: 码:

from bs4 import BeautifulSoup
import requests


url = 'https://fortiguard.com/psirt/FG-IR-17-097'
html_source = requests.get(url).text

soup = BeautifulSoup(html_source, 'lxml')

primary_detail = soup.find_all('div', {'class': 'detail-item'})

for tag in primary_detail:
    try:
        if 'Solutions' in tag.h3.text:
            print(tag.p.text)
    except:
        continue

If the code faces an exception, it'll continue the iteration with the next element in the loop. 如果代码遇到异常,它将继续循环中的下一个元素。 So the code will ignore the first item that don't contain any .text value. 因此,代码将忽略不包含任何.text值的第一项。

Output: 输出:

'Upgrade to FortiWLC-SD version 8.3.0'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM