如何查找帶有特定文本的HTML標簽？ -美麗的湯

Question

來源：

<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>

<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>

<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>

我想找到所有<span class="new">並do something at ，這是我的代碼，我只是不知道為什么它不起作用：

soup = bs4.BeautifulSoup(html, "lxml")
all_tags = soup.findAll(name = "span", attrs = {"class": "new"}, text = re.compile('do something.*'))

沒有發現。 如果我刪除text = re.compile('.*do something.*')以上所有標記都可以找到，我知道我的regex模式應該有問題，那么正確的格式是什么？

Answer 1

您總是可以嘗試一種混合方法：

soup = bs4.BeautifulSoup(html, "lxml")
spans = soup.findAll("span", attrs = {"class": "new"})
regex = re.compile('.*do something at.*')
desired_tags = [span for span in spans if regex.match(span.text)]

Answer 2

遍歷html文件內容並打印匹配的行。 在這里，我用列表l替換了文件內容：

>>> l = ['<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>',

'<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>' ]
>>> for i in range(len(l)):
    if re.search('<span class="new">.*do something.*', l[i]):
        print l[i]


<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>
<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>
>>>

Answer 3

這就是我通常查找文本的方式。

spans = soup.findAll("span", attrs = {"class": "new"})
for s in spans:
    if "do something" in str(s):

如何查找帶有特定文本的HTML標簽？ -美麗的湯

問題描述

3 個解決方案

解決方案1
1 已采納 2012-10-25 01:51:24

解決方案2
0 2012-10-25 01:50:17

解決方案3
0 2012-10-26 04:34:31

如何查找帶有特定文本的HTML標簽？ -美麗的湯

問題描述

3 個解決方案

解決方案1 1 已采納 2012-10-25 01:51:24

解決方案2 0 2012-10-25 01:50:17

解決方案3 0 2012-10-26 04:34:31

解決方案1
1 已采納 2012-10-25 01:51:24

解決方案2
0 2012-10-25 01:50:17

解決方案3
0 2012-10-26 04:34:31