[英]Python BeautifulSoup extracting the text right after a particular tag
我正在嘗試使用beautifulsoup和python從網頁中提取信息。 我想提取特定標簽下面的信息。 要知道其正確的標簽,我想對其文本進行比較,然后在下一個立即標簽中提取文本。
舉例來說,如果以下內容是HTML頁面源代碼的一部分,
<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>
我想檢查<p class="title">
是否具有文本值作為Procurement type
然后我要打印出服務
同樣,如果<p class="title">
有文本值作為Reference
,然后我想打印出來ANAJSKJD23423委員會 ,如果<p class="title">
的價值在於它Countries
,然后打印出所有的國家,即比利時,法國,盧森堡 。
我知道我可以使用<p class="data strong">
提取所有文本,並將它們附加到列表中,然后使用索引獲取所有值。 但是問題是,這些<p class="title>
的出現順序不是固定的。...在某些地方,采購類型之前可能會提到國家/地區,因此,我想對文本進行檢查值,然后提取下一個即時標簽的文本值。我對BeautifulSoup還是陌生的,因此可以提供任何幫助。
您可以采用多種方法來完成操作。
from bs4 import BeautifulSoup
htmldata='''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
soup=BeautifulSoup(htmldata,'html.parser')
items=soup.find_all('p', class_='title')
for item in items:
if ('Procurement type' in item.text) or ('Reference' in item.text):
print(item.findNext('p').text)
您也可以在bs4 4.7.1中使用:contains
偽類。 盡管我已經通過列表,但您可以將每個條件分開
from bs4 import BeautifulSoup as bs
import re
html = 'yourHTML'
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)
輸出:
當您使用.find()
或.find_all()
然后使用.next_sibling
或findNext()
來獲取包含內容的下一個標記時,可以添加參數以檢查特定文本
即:
soup.find('p', {'class':'title'}, text = 'Procurement type')
鑒於:
html = '''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
您可以執行以下操作:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
try:
print (sibling.text)
except:
continue
輸出:
Services
要么
ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
try:
print (sibling.text)
except:
continue
輸出:
ANAJSKJD23423-Commission
要么
countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]
for country in names:
print (country)
輸出:
Belgium
France
Luxembourg
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.