使用BeautifulSoup查找包含特定文本的HTML標記

Question

我正在嘗試獲取包含以下文本模式的HTML文檔中的元素：＃\\ S {11}

<h2> this is cool #12345678901 </h2>

所以，之前的匹配將使用：

soup('h2',text=re.compile(r' #\S{11}'))

結果將是這樣的：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能夠得到匹配的所有文本（見上面的行）。 但我希望文本的父元素匹配，因此我可以將其用作遍歷文檔樹的起點。 在這種情況下，我希望返回所有h2元素，而不是文本匹配。

想法？

Answer 1

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

打印：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

Answer 2

當text=用作標准而不是BeautifulSoup.Tag ，BeautifulSoup搜索操作提供[ BeautifulSoup.NavigableString對象的列表]。 檢查對象的__dict__以查看可用的屬性。 在這些屬性中，由於BS4的變化， parent比previous更受青睞。

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

Answer 3

使用bs4（Beautiful Soup 4），OP的嘗試與預期完全一樣：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

返回[<h2> this is cool #12345678901 </h2>] 。

使用BeautifulSoup查找包含特定文本的HTML標記

問題描述

3 個解決方案

解決方案1
69 已采納 2009-05-14 21:53:21

解決方案2
19

解決方案3
2 2018-01-20 20:17:07

使用BeautifulSoup查找包含特定文本的HTML標記

問題描述

3 個解決方案

解決方案1 69 已采納 2009-05-14 21:53:21

解決方案2 19

解決方案3 2 2018-01-20 20:17:07

解決方案1
69 已采納 2009-05-14 21:53:21

解決方案2
19

解決方案3
2 2018-01-20 20:17:07