簡體   English   中英

在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本

[英]Extract text inside HTML paragraph using BeautifulSoup in Python

<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

這是我想在 Python 中使用 BeautifulSoup 從 HTML 頁面中提取的段落。 我能夠使用 .children 和 .string 方法獲取標簽內的值。 但是我無法在沒有任何標簽的段落中獲取文本“Several new Point of Sale惡意軟件...”。 我試過使用 soup.p.text 、 .get_text() 等......但沒有用。

import urllib.request
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"

html = urllib.request.urlopen(url)

htmlParse = BeautifulSoup(html, 'html.parser')

for para in htmlParse.find_all("p"):
    print(para.get_text())

使用帶有text=True find_all()來查找所有文本節點,使用recursive=False來僅搜索父p標簽的直接子節點:

from bs4 import BeautifulSoup

data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""

soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

打印:

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM