Python + BeautifulSoup: How can I get full link from href attribute?

Question

I'm putting together a web crawler for practice & learning and found some issues. My original thought process was...

On a given page, find all href attributes. If the href value is a valid link, go to this new link and continue
If the href value is a path (Eg "/patients/patient-portal", or "/services/financial-assistance"), I would append this to the end of the current URL I'm on and continue again.

A problem arose which I hadn't realized. Some of the paths reference other resources on the website. (Picture included). Current url is "patients-visitors/advance-directives/" and the resource "services/family-medicine" actually refers to columbiabasinhospital.org/services/family-medicine". The way I have it set up would make an incorrect URL (patients-visitors/advance-directives/services/family-medicine). Mousing over the resource shows the full link. I'm wondering if there's a way to retrieve that using BeautifulSoup? Thank you!

Answer 1

Use urllib.parse.urljoin to return the correct URL from a base URL and another, potentially relative, URL/path

from urllib.parse import urljoin

new_url = urljoin(current_url, href)

For example

urljoin('http://localhost/foo/bar/', '/baz/')
# Outputs 'http://localhost/baz/'

Answer 2

You can use from urllib.parse import urljoin . But, you can write it your own self too!

Suppose that the current URL is: http://example.com/path1/path2

When the value of href attribute is sth like: /x you must add it to the root path, ie http://example.com/x

However, when the value of href attribute is sth like: ./x , or x you need to add it to the whole address, ie http://example.com/path1/x

Python + BeautifulSoup: How can I get full link from href attribute?

Question

2 answers

solution1
1 2021-11-14 06:53:52

solution2
0 2021-11-14 07:01:42

Python + BeautifulSoup: How can I get full link from href attribute?

Question

2 answers

solution1 1 2021-11-14 06:53:52

solution2 0 2021-11-14 07:01:42

solution1
1 2021-11-14 06:53:52

solution2
0 2021-11-14 07:01:42