I'm putting together a web crawler for practice & learning and found some issues. My original thought process was...
A problem arose which I hadn't realized. Some of the paths reference other resources on the website. (Picture included). Current url is "patients-visitors/advance-directives/" and the resource "services/family-medicine" actually refers to columbiabasinhospital.org/services/family-medicine". The way I have it set up would make an incorrect URL (patients-visitors/advance-directives/services/family-medicine). Mousing over the resource shows the full link. I'm wondering if there's a way to retrieve that using BeautifulSoup? Thank you!
Use urllib.parse.urljoin
to return the correct URL from a base URL and another, potentially relative, URL/path
from urllib.parse import urljoin
new_url = urljoin(current_url, href)
For example
urljoin('http://localhost/foo/bar/', '/baz/')
# Outputs 'http://localhost/baz/'
You can use from urllib.parse import urljoin
. But, you can write it your own self too!
Suppose that the current URL is: http://example.com/path1/path2
When the value of href attribute is sth like: /x
you must add it to the root path, ie http://example.com/x
However, when the value of href attribute is sth like: ./x
, or x
you need to add it to the whole address, ie http://example.com/path1/x
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.