[英]Trying to find all <a> elements without a *specific* class
我第一次嘗試 web 抓取,我正在使用 BeautifulSoup 從網站收集信息。 我正在嘗試獲取具有一個 class 但沒有另一個的所有元素。 例如:
from bs4 import BeautifulSoup
html = """
<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html)
在這個例子中,我想用something
來獲取所有元素。 但是,當找到包含該 class 的所有元素時,我還得到包含somethingelse
class 的元素,我不想要這些。
我用來獲取它的代碼是:
results = soup.find_all("a", {"class": "something"})
任何幫助表示贊賞。 謝謝。
from bs4 import BeautifulSoup
text = '''<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>'''
soup = BeautifulSoup(text, 'html.parser')
r1 = soup.find_all("a", {"class": "something"})
r2 = soup.find_all("a", {"class": "somethingelse"})
for item in r2:
if item in r1:
r1.remove(item)
print(r1)
Output
[<a class="something">Information I want</a>]
要提取標簽中存在的文本,只需添加以下行:
for item in r1:
print(item.text)
Output
Information I want
對於此任務,您可以通過 lambda function 查找元素,例如:
from bs4 import BeautifulSoup
html_doc = """<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
a = soup.find(
lambda tag: tag.name == "a" and tag.get("class", []) == ["something"]
)
print(a)
印刷:
<a class="something">Information I want</a>
或者:將“類”指定為list
:
a = soup.find("a", {"class": ["something"]})
print(a)
印刷:
<a class="something">Information I want</a>
編輯:
對於過濾type-icon type-X
:
from bs4 import BeautifulSoup
html_doc = """
<a class="type-icon type-1">Information I want 1</a>
<a class="type-icon type-1 type-cell type-abbr">Information I don't want</a>
<a class="type-icon type-2">Information I want 2</a>
<a class="type-icon type-2 type-cell type-abbr">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
my_types = ["type-icon", "type-1", "type-2"]
def my_filter(tag):
if tag.name != "a":
return False
c = tag.get("class", [])
return "type-icon" in c and not set(c).difference(my_types)
a = soup.find_all(my_filter)
print(a)
印刷:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]
或者先提取你不想要的標簽:
soup = BeautifulSoup(html_doc, "html.parser")
# extract tags I don't want:
for t in soup.select(".type-cell.type-abbr"):
t.extract()
print(soup.select(".type-icon.type-1, .type-icon.type-2"))
印刷:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.