BeautifulSoup find_all（）找不到所有請求的元素

Question

我在BeautifulSoup中看到了一些奇怪的行為，如以下示例所示。

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
paras = soup.find_all('p', string=pattern)
print(len(paras)) # expected to find 3 paragraphs with word "color" in it
  2
print(paras[0].prettify())
  <p class="blue">
    This paragraph as a color of blue.
  </p>

print(paras[1].prettify())
  <p>
    This paragraph does not have a color.
  </p>

如您所見，由於某些原因， This has a color of red. Because it likes the color red第一段This has a color of red. Because it likes the color red This has a color of red. Because it likes the color red不會被find_all(...)拾取， find_all(...)我不知道為什么不這樣做。

Answer 1

string屬性希望標簽僅包含文本，而不包含標簽。 如果您嘗試為第一個p標簽打印.string ，它將返回None ，因為它包含標簽。

或者，為了更好地解釋它，文檔中說：

如果標簽只有一個子代，並且該子代是NavigableString ，則該子代可以作為.string

如果標簽包含多個內容，則不清楚.string應該指向什么，因此.string被定義為None

解決此問題的方法是使用lambda函數。

html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')

first_p = soup.find('p')
print(first_p)
# <p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>
print(first_p.string)
# None
print(first_p.text)
# This has a color of red. Because it likes the color red

paras = soup.find_all(lambda tag: tag.name == 'p' and 'color' in tag.text.lower())
print(paras)
# [<p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>, <p class="blue">This paragraph has a color of blue.</p>, <p>This paragraph does not have a color.</p>]

Answer 2

如果要掌握'p' ，則可以執行以下操作：

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')

paras = soup.find_all('p')
for p in paras:
  print (p.get_text())

Answer 3

我實際上還沒有弄清楚為什么指定find_all(...)的字符串（或文本，對於BeautifulSoup的較早版本為文本）參數沒有給我想要的東西，但是以下內容find_all(...)了我一個通用的解決方案。

pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
desired_tags = [tag for tag in soup.find_all('p') if pattern.search(tag.text) is not None]

BeautifulSoup find_all（）找不到所有請求的元素

問題描述

3 個解決方案

解決方案1
2 2018-03-18 04:55:13

解決方案2
0 2018-03-17 15:54:28

解決方案3
0 2018-03-17 19:07:20

BeautifulSoup find_all（）找不到所有請求的元素

問題描述

3 個解決方案

解決方案1 2 2018-03-18 04:55:13

解決方案2 0 2018-03-17 15:54:28

解決方案3 0 2018-03-17 19:07:20

解決方案1
2 2018-03-18 04:55:13

解決方案2
0 2018-03-17 15:54:28

解決方案3
0 2018-03-17 19:07:20