在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本

Question

<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

這是我想在 Python 中使用 BeautifulSoup 從 HTML 頁面中提取的段落。 我能夠使用 .children 和 .string 方法獲取標簽內的值。 但是我無法在沒有任何標簽的段落中獲取文本“Several new Point of Sale惡意軟件...”。 我試過使用 soup.p.text 、 .get_text() 等......但沒有用。

Answer 1

import urllib.request
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"

html = urllib.request.urlopen(url)

htmlParse = BeautifulSoup(html, 'html.parser')

for para in htmlParse.find_all("p"):
    print(para.get_text())

Answer 2

使用帶有text=True find_all()來查找所有文本節點，使用recursive=False來僅搜索父p標簽的直接子節點：

from bs4 import BeautifulSoup

data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""

soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

打印：

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..

在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本

問題描述

2 個解決方案

解決方案1
2 2021-11-21 09:05:46

解決方案2
1 已采納 2014-12-24 05:38:42

在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本

問題描述

2 個解決方案

解決方案1 2 2021-11-21 09:05:46

解決方案2 1 已采納 2014-12-24 05:38:42

解決方案1
2 2021-11-21 09:05:46

解決方案2
1 已采納 2014-12-24 05:38:42