[英]BeautifulSoup find_all() Doesn't Find All Requested Elements
我在BeautifulSoup中看到了一些奇怪的行为,如以下示例所示。
import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
paras = soup.find_all('p', string=pattern)
print(len(paras)) # expected to find 3 paragraphs with word "color" in it
2
print(paras[0].prettify())
<p class="blue">
This paragraph as a color of blue.
</p>
print(paras[1].prettify())
<p>
This paragraph does not have a color.
</p>
如您所见,由于某些原因, <p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
第一段<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
不会被find_all(...)
拾取, find_all(...)
我不知道为什么不这样做。
string
属性希望标签仅包含文本,而不包含标签。 如果您尝试为第一个p
标签打印.string
,它将返回None
,因为它包含标签。
或者,为了更好地解释它, 文档中说:
如果标签只有一个子代,并且该子代是
NavigableString
,则该子代可以作为.string
如果标签包含多个内容,则不清楚
.string
应该指向什么,因此.string
被定义为None
解决此问题的方法是使用lambda
函数。
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
first_p = soup.find('p')
print(first_p)
# <p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>
print(first_p.string)
# None
print(first_p.text)
# This has a color of red. Because it likes the color red
paras = soup.find_all(lambda tag: tag.name == 'p' and 'color' in tag.text.lower())
print(paras)
# [<p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>, <p class="blue">This paragraph has a color of blue.</p>, <p>This paragraph does not have a color.</p>]
如果要掌握'p'
,则可以执行以下操作:
import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
paras = soup.find_all('p')
for p in paras:
print (p.get_text())
我实际上还没有弄清楚为什么指定find_all(...)
的字符串(或文本,对于BeautifulSoup的较早版本为文本)参数没有给我想要的东西,但是以下内容find_all(...)
了我一个通用的解决方案。
pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
desired_tags = [tag for tag in soup.find_all('p') if pattern.search(tag.text) is not None]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.