[英]Extract text blocks between <p> tags separate by <br>
我想解析以下示例中的所有文本塊(TEXT CONTENT,BODY CONTENT和EXTRA CONTENT)。 您可能會注意到,所有這些文本塊在每個“ p”標記內的位置都不同。
<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>
我想以表格格式顯示最終結果:
Col1 Col2 Col3
TITLE CONTENT #1 BODY CONTENT #1 EXTRA CONTENT #1
TITLE CONTENT #2 BODY CONTENT #2 EXTRA CONTENT #2
TITLE CONTENT #3 BODY CONTENT #3 EXTRA CONTENT #3
我試過了
for i in soup.find_all('p'):
title = i.find('strong')
if not isinstance(title.nextSibling, NavigableString):
body= title.nextSibling.nextSibling
extra= body.nextSibling.nextSibling
else:
if len(title.nextSibling) > 3:
body= title.nextSibling
extra= body.nextSibling.nextSibling
else:
body= title.nextSibling.nextSibling.nextSibling
extra= body.nextSibling.nextSibling
但這看起來效率不高。 我想知道是否有人有更好的解決方案?
任何幫助將不勝感激!
謝謝!
重要的是要注意.next_sibling
也可以工作,因為您可能需要收集多個文本節點,所以您必須使用一些邏輯來知道調用它多少次。 在此示例中,我發現僅瀏覽后代會更容易,因為后代注意到了一些重要特征,這些特征標志着我要做一些不同的事情。
您只需要分解要抓取的內容的特征即可。 在這種簡單的情況下,我們知道:
strong
元素時,我們想要捕獲“標題”。 br
元素時,我們想開始捕獲“ content”。 br
元素時,我們要開始捕獲“額外內容”。 我們可以:
plans
類以獲取所有計划。 plans
的所有后代節點。 from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString
html = """
<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>
"""
soup = bs(html, 'html.parser')
content = []
# Iterate through all the plans
for plans in soup.select('.plans'):
# Lists that will hold the text nodes of interest
title = []
body = []
extra = []
current = None # Reference to one of the above lists to store data
br = 0 # Count number of br tags
# Iterate through all the descendant nodes of a plan
for node in plans.descendants:
# See if the node is a Tag/Element
if isinstance(node, Tag):
if node.name == 'strong':
# Strong tags/elements contain our title
# So set the current container for text to the the title list
current = title
elif node.name == 'br':
# We've found a br Tag/Element
br += 1
if br == 1:
# If this is the first, we need to set the current
# container for text to the body list
current = body
elif br == 2:
# If this is the second, we need to set the current
# container for text to the extra list
current = extra
elif isinstance(node, NavigableString) and current is not None:
# We've found a navigable string (not a tag/element), so let's
# store the text node in the current list container.
# NOTE: You may have to filter out things like HTML comments in a real world example.
current.append(node)
# Store the captured title, body, and extra text for the current plan.
# For each list, join the text into one string and strip leading and trailing whitespace
# from each entry in the row.
content.append([''.join(entry).strip() for entry in (title, body, extra)])
print(content)
然后,您可以根據需要打印數據,但是應該以一種很好的邏輯方式捕獲數據,如下所示:
[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]
有多種方法可以做到這一點,這只是其中之一。
使用切片的另一種方式,假設您的列表不可變
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")
def slicing(l):
new_list = []
for i in range(0,len(l),3):
new_list.append(l[i:i+3])
return new_list
result = slicing(list(soup.stripped_strings))
print(result)
產量
[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]
在這種情況下,可以使用帶有separator=
參數的BeautifulSoup的get_text()
方法:
data = '''<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
print(''.join('{: ^25}'.format(i) for i in p))
打印:
Col1 Col2 Col3
TITLE CONTENT #1 BODY CONTENT #1 EXTRA CONTENT #1
TITLE CONTENT #2 BODY CONTENT #2 EXTRA CONTENT #2
TITLE CONTENT #3 BODY CONTENT #3 EXTRA CONTENT #3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.