試圖找到<a>沒有特定 class 的所有元素</a>

Question

我第一次嘗試 web 抓取，我正在使用 BeautifulSoup 從網站收集信息。 我正在嘗試獲取具有一個 class 但沒有另一個的所有元素。 例如：

from bs4 import BeautifulSoup

html = """
<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""

soup = BeautifulSoup(html)

在這個例子中，我想用something來獲取所有元素。 但是，當找到包含該 class 的所有元素時，我還得到包含somethingelse class 的元素，我不想要這些。

我用來獲取它的代碼是：

results = soup.find_all("a", {"class": "something"})

任何幫助表示贊賞。 謝謝。

Answer 1

這將正常工作：

from bs4 import BeautifulSoup
text = '''<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>'''

soup = BeautifulSoup(text, 'html.parser')
r1 = soup.find_all("a", {"class": "something"})
r2 = soup.find_all("a", {"class": "somethingelse"})
for item in r2:
    if item in r1:
        r1.remove(item)
print(r1)

Output

[<a class="something">Information I want</a>]

要提取標簽中存在的文本，只需添加以下行：

for item in r1:
    print(item.text)

Output

Information I want

Answer 2

對於此任務，您可以通過 lambda function 查找元素，例如：

from bs4 import BeautifulSoup


html_doc = """<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""

soup = BeautifulSoup(html_doc, "html.parser")

a = soup.find(
    lambda tag: tag.name == "a" and tag.get("class", []) == ["something"]
)
print(a)

印刷：

<a class="something">Information I want</a>

或者：將“類”指定為list ：

a = soup.find("a", {"class": ["something"]})    
print(a)

印刷：

<a class="something">Information I want</a>

編輯：

對於過濾type-icon type-X ：

from bs4 import BeautifulSoup


html_doc = """
<a class="type-icon type-1">Information I want 1</a>
<a class="type-icon type-1 type-cell type-abbr">Information I don't want</a>
<a class="type-icon type-2">Information I want 2</a>
<a class="type-icon type-2 type-cell type-abbr">Information I don't want</a>
"""

soup = BeautifulSoup(html_doc, "html.parser")


my_types = ["type-icon", "type-1", "type-2"]


def my_filter(tag):
    if tag.name != "a":
        return False
    c = tag.get("class", [])
    return "type-icon" in c and not set(c).difference(my_types)


a = soup.find_all(my_filter)
print(a)

印刷：

[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]

或者先提取你不想要的標簽：

soup = BeautifulSoup(html_doc, "html.parser")

# extract tags I don't want:
for t in soup.select(".type-cell.type-abbr"):
    t.extract()

print(soup.select(".type-icon.type-1, .type-icon.type-2"))

印刷：

[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]

試圖找到<a>沒有特定 class 的所有元素</a>

問題描述

2 個解決方案

解決方案1
2 已采納 2021-04-04 14:30:34

解決方案2
1 2021-04-04 14:22:05

試圖找到<a>沒有*特定* class 的所有元素</a>

問題描述

2 個解決方案

解決方案1 2 已采納 2021-04-04 14:30:34

解決方案2 1 2021-04-04 14:22:05

試圖找到<a>沒有特定 class 的所有元素</a>

解決方案1
2 已采納 2021-04-04 14:30:34

解決方案2
1 2021-04-04 14:22:05