[英]Using BeautifulSoup to find a HTML tag that contains certain text
我正在嘗試獲取包含以下文本模式的HTML文檔中的元素:#\\ S {11}
<h2> this is cool #12345678901 </h2>
所以,之前的匹配將使用:
soup('h2',text=re.compile(r' #\S{11}'))
結果將是這樣的:
[u'blahblah #223409823523', u'thisisinteresting #293845023984']
我能夠得到匹配的所有文本(見上面的行)。 但我希望文本的父元素匹配,因此我可以將其用作遍歷文檔樹的起點。 在這種情況下,我希望返回所有h2元素,而不是文本匹配。
想法?
from BeautifulSoup import BeautifulSoup
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
for elem in soup(text=re.compile(r' #\S{11}')):
print elem.parent
打印:
<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
當text=
用作標准而不是BeautifulSoup.Tag
,BeautifulSoup搜索操作提供[ BeautifulSoup.NavigableString
對象的列表]。 檢查對象的__dict__
以查看可用的屬性。 在這些屬性中,由於BS4的變化 , parent
比previous
更受青睞。
from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')
pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>> 'nextSibling': None,
#>> 'parent': <h2>this is cool #12345678901</h2>,
#>> 'previous': <h2>this is cool #12345678901</h2>,
#>> 'previousSibling': None}
print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True
使用bs4(Beautiful Soup 4),OP的嘗試與預期完全一樣:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))
返回[<h2> this is cool #12345678901 </h2>]
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.