![](/img/trans.png)
[英]How to extract paragraph text in python using lxml from html file?
[英]Extract text inside HTML paragraph using BeautifulSoup in Python
<p>
<a name="533660373"></a>
<strong>Title: Point of Sale Threats Proliferate</strong><br />
<strong>Severity: Normal Severity</strong><br />
<strong>Published: Thursday, December 04, 2014 20:27</strong><br />
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
<em>Analysis: Emboldened by past success and media attention, threat actors ..</em>
<br />
</p>
這是我想在 Python 中使用 BeautifulSoup 從 HTML 頁面中提取的段落。 我能夠使用 .children 和 .string 方法獲取標簽內的值。 但是我無法在沒有任何標簽的段落中獲取文本“Several new Point of Sale惡意軟件...”。 我試過使用 soup.p.text 、 .get_text() 等......但沒有用。
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"
html = urllib.request.urlopen(url)
htmlParse = BeautifulSoup(html, 'html.parser')
for para in htmlParse.find_all("p"):
print(para.get_text())
使用帶有text=True
find_all()
來查找所有文本節點,使用recursive=False
來僅搜索父p
標簽的直接子節點:
from bs4 import BeautifulSoup
data = """
<p>
<a name="533660373"></a>
<strong>Title: Point of Sale Threats Proliferate</strong><br />
<strong>Severity: Normal Severity</strong><br />
<strong>Published: Thursday, December 04, 2014 20:27</strong><br />
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
<em>Analysis: Emboldened by past success and media attention, threat actors ..</em>
<br />
</p>
"""
soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))
打印:
Several new Point of Sale malware families have emerged recently, to include LusyPOS,..
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.