繁体   English   中英

如何使用 python 中的 BeautifulSoup package 从网站中提取 href 内容

[英]how to extract a href content from a website using BeautifulSoup package in python

我有以下示例

<h2 class="m0 t-regular">
<a data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/" data-job-id="4276199">
Executive Chef  </a>
</h2>

如何找到“a”标签

直到现在它返回空结果:

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)

    follow_links = [
         a["href"] for a in
         soup.find_all("h2", class_="m0 t-regular")
         if "#" not in a["href"]
     ]
     print(follow_links)

结果:

[]

问题是如何返回链接?

您已经接近它了,使用['href']获取 url。

例子

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)
links = []
for a in soup.select("h2.m0.t-regular a"):
    if a['href'] not in links:
        links.append(a['href'])
links

要获取 href 链接,您需要以下代码:

follow_links = [p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]

如果您不想要 href="/en/qatar/jobs/executive-chef-4276199/",请添加“https://www.bayt.com/”

follow_links = ["https://www.bayt.com/"+p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]

试试这个让你href:

follow_links=soup.find_all('your class a') 
for link in follow_links: #Then you can process it with something like:
    if "#" not in link.a['href']: 
        follow_links + [link.a["href"]]




    

你抓住了

<h2 class="m0 t-regular">
<a data-job-id="4276199" data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/">
Executive Chef  </a>
</h2>

通过每次迭代使用soup.find_all("h2", class_="m0 t-regular") 所以你需要在这里捕捉a标签然后捕捉' href '属性。

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,"lxml")
# my solution
links = soup.select('h2.m0.t-regular')
for link in links:
    print(link.a['href'])

print(soup.find_all("h2", class_="m0 t-regular")[0])
follow_links = [
     tag_a.a["href"] for tag_a in
     soup.find_all("h2", class_="m0 t-regular")
     if "#" not in tag_a.a["href"]
 ]
print(follow_links)

根据您的代码,您正在提取h2标签,您应该获得 h2 的下一个标签,即a标签,从那里您只能获得一个带有href的标签

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)

follow_links = [a.find_next('a')['href'] for a in soup.find_all("h2", class_="m0 t-regular")]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM