简体   繁体   中英

Scraping using Python Beautifulsoup getting the url of href that is a link

Using Python/BeautifulSoup to scape some documentation URL I am trying to get the actual link for a href. Now the href is not an HTML link but a "embedded" that if I hover over it in a browser, it gives me the the actual URL.

the "view source" of the page has this: <li class="toctree-l2"><a class="reference internal" href="accessanalyzer.html">AccessAnalyzer</a></li>

Now the following code does work and does get me the href string:

for i in soup.findAll('a', attrs={'class': 'reference internal'}):
        if "AccessAnalyzer" in i:
            print(i)
            link = i['href']
            print(link)

(output)
<a class="reference internal" href="accessanalyzer.html">AccessAnalyzer</a>
accessanalyzer.html

What I am trying to get is the actual URL of the accessanalyzer.html which is:

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/accessanalyzer.html

When I hover over the href or click on it will take me to that URL.

How can I get the URL? Also what is the name of the concept of having an href that has an embedded link and not actual text called? (so I can research more)

You would have to some extra processing after retrieving the HREF value.

What you would need to do is get the base URL path of the source page, and append the HREF value.

Let's say the source page is "https://example.com/stuff/source.html", and that page contains a link with HREF "foo.html". You would need to get the base URL path of the source page (which is "https://example.com/stuff/" and append the HREF value to get "https://example.com/stuff/foo.html".

You can use the dirname function to help you:

>>> dir = os.path.dirname('https://example.com/stuff/source.html')
>>> dir
'https://example.com/stuffl'

and then join the 2 parts together:

>>> os.path.join(dir, "foo.html")
'https://example.com/stuff/foo.html'

Similar to what's described here. I believe you're actually going to need some kind of webdriver automator (Selenium, etc.) to simulate the hover-over and get the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM