简体   繁体   中英

Python + BeautifulSoup: How can I get full link from href attribute?

I'm putting together a web crawler for practice & learning and found some issues. My original thought process was...

  1. On a given page, find all href attributes. If the href value is a valid link, go to this new link and continue
  2. If the href value is a path (Eg "/patients/patient-portal", or "/services/financial-assistance"), I would append this to the end of the current URL I'm on and continue again.

A problem arose which I hadn't realized. Some of the paths reference other resources on the website. (Picture included). Current url is "patients-visitors/advance-directives/" and the resource "services/family-medicine" actually refers to columbiabasinhospital.org/services/family-medicine". The way I have it set up would make an incorrect URL (patients-visitors/advance-directives/services/family-medicine). Mousing over the resource shows the full link. I'm wondering if there's a way to retrieve that using BeautifulSoup? Thank you!

在此处输入图片说明

Use urllib.parse.urljoin to return the correct URL from a base URL and another, potentially relative, URL/path

from urllib.parse import urljoin

new_url = urljoin(current_url, href)

For example

urljoin('http://localhost/foo/bar/', '/baz/')
# Outputs 'http://localhost/baz/'

You can use from urllib.parse import urljoin . But, you can write it your own self too!

Suppose that the current URL is: http://example.com/path1/path2

When the value of href attribute is sth like: /x you must add it to the root path, ie http://example.com/x

However, when the value of href attribute is sth like: ./x , or x you need to add it to the whole address, ie http://example.com/path1/x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM