how to extract a href content from a website using BeautifulSoup package in python

Question

I have the following example

<h2 class="m0 t-regular">
<a data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/" data-job-id="4276199">
Executive Chef  </a>
</h2>

How to find the "a" tag ??

Until now it return empty result:

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)

    follow_links = [
         a["href"] for a in
         soup.find_all("h2", class_="m0 t-regular")
         if "#" not in a["href"]
     ]
     print(follow_links)

result:

[]

Question is how to return the link?

Answer 1

You are close to it, use ['href'] to get the url.

Example

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)
links = []
for a in soup.select("h2.m0.t-regular a"):
    if a['href'] not in links:
        links.append(a['href'])
links

Answer 2

To get href link you need this code:

follow_links = [p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]

add "https://www.bayt.com/" if you don't want just href="/en/qatar/jobs/executive-chef-4276199/"

follow_links = ["https://www.bayt.com/"+p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]

Answer 3

Try this to get you href:

follow_links=soup.find_all('your class a') 
for link in follow_links: #Then you can process it with something like:
    if "#" not in link.a['href']: 
        follow_links + [link.a["href"]]

Answer 4

You catched

<h2 class="m0 t-regular">
<a data-job-id="4276199" data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/">
Executive Chef  </a>
</h2>

by using soup.find_all("h2", class_="m0 t-regular") per iteration. So you need to catch here a tag then catch ' href ' attribute.

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,"lxml")
# my solution
links = soup.select('h2.m0.t-regular')
for link in links:
    print(link.a['href'])

print(soup.find_all("h2", class_="m0 t-regular")[0])
follow_links = [
     tag_a.a["href"] for tag_a in
     soup.find_all("h2", class_="m0 t-regular")
     if "#" not in tag_a.a["href"]
 ]
print(follow_links)

Answer 5

As per your code, you are extracting h2 tags, you should get the next tag of h2 ie a tag and from there you can get only a tags which have a href

import time

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
    "lxml"
)

follow_links = [a.find_next('a')['href'] for a in soup.find_all("h2", class_="m0 t-regular")]

how to extract a href content from a website using BeautifulSoup package in python

Question

result:

5 answers

solution1
0 ACCPTED 2021-01-26 09:21:38

solution2
0 2021-01-26 09:22:32

solution3
0 2021-01-26 09:24:54

solution4
0 2021-01-26 09:40:12

solution5
0 2021-01-26 10:25:32

how to extract a href content from a website using BeautifulSoup package in python

Question

result:

5 answers

solution1 0 ACCPTED 2021-01-26 09:21:38

solution2 0 2021-01-26 09:22:32

solution3 0 2021-01-26 09:24:54

solution4 0 2021-01-26 09:40:12

solution5 0 2021-01-26 10:25:32

solution1
0 ACCPTED 2021-01-26 09:21:38

solution2
0 2021-01-26 09:22:32

solution3
0 2021-01-26 09:24:54

solution4
0 2021-01-26 09:40:12

solution5
0 2021-01-26 10:25:32