簡體   English   中英

試圖找到<a>沒有*特定* class 的所有元素</a>

[英]Trying to find all <a> elements without a *specific* class

我第一次嘗試 web 抓取,我正在使用 BeautifulSoup 從網站收集信息。 我正在嘗試獲取具有一個 class 但沒有另一個的所有元素。 例如:

from bs4 import BeautifulSoup

html = """
<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""

soup = BeautifulSoup(html)

在這個例子中,我想用something來獲取所有元素。 但是,當找到包含該 class 的所有元素時,我還得到包含somethingelse class 的元素,我想要這些。

我用來獲取它的代碼是:

results = soup.find_all("a", {"class": "something"})

任何幫助表示贊賞。 謝謝。

這將正常工作:

from bs4 import BeautifulSoup
text = '''<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>'''

soup = BeautifulSoup(text, 'html.parser')
r1 = soup.find_all("a", {"class": "something"})
r2 = soup.find_all("a", {"class": "somethingelse"})
for item in r2:
    if item in r1:
        r1.remove(item)
print(r1)

Output

[<a class="something">Information I want</a>]

要提取標簽中存在的文本,只需添加以下行:

for item in r1:
    print(item.text)

Output

Information I want

對於此任務,您可以通過 lambda function 查找元素,例如:

from bs4 import BeautifulSoup


html_doc = """<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""

soup = BeautifulSoup(html_doc, "html.parser")

a = soup.find(
    lambda tag: tag.name == "a" and tag.get("class", []) == ["something"]
)
print(a)

印刷:

<a class="something">Information I want</a>

或者:將“類”指定為list

a = soup.find("a", {"class": ["something"]})    
print(a)

印刷:

<a class="something">Information I want</a>

編輯:

對於過濾type-icon type-X

from bs4 import BeautifulSoup


html_doc = """
<a class="type-icon type-1">Information I want 1</a>
<a class="type-icon type-1 type-cell type-abbr">Information I don't want</a>
<a class="type-icon type-2">Information I want 2</a>
<a class="type-icon type-2 type-cell type-abbr">Information I don't want</a>
"""

soup = BeautifulSoup(html_doc, "html.parser")


my_types = ["type-icon", "type-1", "type-2"]


def my_filter(tag):
    if tag.name != "a":
        return False
    c = tag.get("class", [])
    return "type-icon" in c and not set(c).difference(my_types)


a = soup.find_all(my_filter)
print(a)

印刷:

[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]

或者先提取你不想要的標簽:

soup = BeautifulSoup(html_doc, "html.parser")

# extract tags I don't want:
for t in soup.select(".type-cell.type-abbr"):
    t.extract()

print(soup.select(".type-icon.type-1, .type-icon.type-2"))

印刷:

[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM