Python BeautifulSoup在特定標簽之后立即提取文本

Question

我正在嘗試使用beautifulsoup和python從網頁中提取信息。 我想提取特定標簽下面的信息。 要知道其正確的標簽，我想對其文本進行比較，然后在下一個立即標簽中提取文本。
舉例來說，如果以下內容是HTML頁面源代碼的一部分，

<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>

我想檢查是否具有文本值作為Procurement type然后我要打印出服務
同樣，如果有文本值作為Reference ，然后我想打印出來ANAJSKJD23423委員會 ，如果的價值在於它Countries ，然后打印出所有的國家，即比利時，法國，盧森堡 。

我知道我可以使用提取所有文本，並將它們附加到列表中，然后使用索引獲取所有值。 但是問題是，這些的出現順序不是固定的。...在某些地方，采購類型之前可能會提到國家/地區，因此，我想對文本進行檢查值，然后提取下一個即時標簽的文本值。我對BeautifulSoup還是陌生的，因此可以提供任何幫助。

Answer 1

您可以采用多種方法來完成操作。

from bs4 import BeautifulSoup
htmldata='''<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>'''

soup=BeautifulSoup(htmldata,'html.parser')

items=soup.find_all('p', class_='title')
for item in items:
    if ('Procurement type' in item.text) or ('Reference' in item.text):
        print(item.findNext('p').text)

Answer 2

您也可以在bs4 4.7.1中使用:contains偽類。 盡管我已經通過列表，但您可以將每個條件分開

from bs4 import BeautifulSoup as bs
import re

html = 'yourHTML'   
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)

輸出：

Answer 3

當您使用.find()或.find_all()然后使用.next_sibling或findNext()來獲取包含內容的下一個標記時，可以添加參數以檢查特定文本

即：

soup.find('p', {'class':'title'}, text = 'Procurement type')

鑒於：

html = '''<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>'''

您可以執行以下操作：

from bs4 import BeautifulSoup     

soup = BeautifulSoup(html, 'html.parser')

alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
    try:
        print (sibling.text)
    except:
        continue

輸出：

Services

要么

ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
    try:
        print (sibling.text)
    except:
        continue

輸出：

ANAJSKJD23423-Commission

要么

countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]

for country in names:
    print (country)

輸出：

Belgium
France
Luxembourg

Python BeautifulSoup在特定標簽之后立即提取文本

問題描述

3 個解決方案

解決方案1
4 已采納 2019-04-10 11:48:44

解決方案2
2 2019-04-10 12:24:17

解決方案3
1 2019-04-10 11:46:51

Python BeautifulSoup在特定標簽之后立即提取文本

問題描述

3 個解決方案

解決方案1 4 已采納 2019-04-10 11:48:44

解決方案2 2 2019-04-10 12:24:17

解決方案3 1 2019-04-10 11:46:51

解決方案1
4 已采納 2019-04-10 11:48:44

解決方案2
2 2019-04-10 12:24:17

解決方案3
1 2019-04-10 11:46:51