![](/img/trans.png)
[英]How to extract href content using beautifulsoup in python
[英]how to extract a href content from a website using BeautifulSoup package in python
我有以下示例
<h2 class="m0 t-regular">
<a data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/" data-job-id="4276199">
Executive Chef </a>
</h2>
如何找到“a”標簽?
直到現在它返回空結果:
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
follow_links = [
a["href"] for a in
soup.find_all("h2", class_="m0 t-regular")
if "#" not in a["href"]
]
print(follow_links)
[]
問題是如何返回鏈接?
您已經接近它了,使用['href']
獲取 url。
例子
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
links = []
for a in soup.select("h2.m0.t-regular a"):
if a['href'] not in links:
links.append(a['href'])
links
要獲取 href 鏈接,您需要以下代碼:
follow_links = [p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]
如果您不想要 href="/en/qatar/jobs/executive-chef-4276199/",請添加“https://www.bayt.com/”
follow_links = ["https://www.bayt.com/"+p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]
試試這個讓你href:
follow_links=soup.find_all('your class a')
for link in follow_links: #Then you can process it with something like:
if "#" not in link.a['href']:
follow_links + [link.a["href"]]
你抓住了
<h2 class="m0 t-regular">
<a data-job-id="4276199" data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/">
Executive Chef </a>
</h2>
通過每次迭代使用soup.find_all("h2", class_="m0 t-regular")
。 所以你需要在這里捕捉a
標簽然后捕捉' href
'屬性。
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,"lxml")
# my solution
links = soup.select('h2.m0.t-regular')
for link in links:
print(link.a['href'])
print(soup.find_all("h2", class_="m0 t-regular")[0])
follow_links = [
tag_a.a["href"] for tag_a in
soup.find_all("h2", class_="m0 t-regular")
if "#" not in tag_a.a["href"]
]
print(follow_links)
根據您的代碼,您正在提取h2
標簽,您應該獲得 h2 的下一個標簽,即a
標簽,從那里您只能獲得一個帶有href
的標簽
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
follow_links = [a.find_next('a')['href'] for a in soup.find_all("h2", class_="m0 t-regular")]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.