简体   繁体   English

使用 Python Beautifulsoup 获取链接的 href 网址

[英]Scraping using Python Beautifulsoup getting the url of href that is a link

Using Python/BeautifulSoup to scape some documentation URL I am trying to get the actual link for a href.使用 Python/BeautifulSoup 来转义一些文档 URL 我试图获取一个 href 的实际链接。 Now the href is not an HTML link but a "embedded" that if I hover over it in a browser, it gives me the the actual URL.现在 href 不是一个 HTML 链接,而是一个“嵌入的”链接,如果我在浏览器中将鼠标悬停在它上面,它会给我实际的 URL。

the "view source" of the page has this: <li class="toctree-l2"><a class="reference internal" href="accessanalyzer.html">AccessAnalyzer</a></li>页面的“查看源代码”是这样的: <li class="toctree-l2"><a class="reference internal" href="accessanalyzer.html">AccessAnalyzer</a></li>

Now the following code does work and does get me the href string:现在下面的代码确实可以工作并且确实为我提供了 href 字符串:

for i in soup.findAll('a', attrs={'class': 'reference internal'}):
        if "AccessAnalyzer" in i:
            print(i)
            link = i['href']
            print(link)

(output)
<a class="reference internal" href="accessanalyzer.html">AccessAnalyzer</a>
accessanalyzer.html

What I am trying to get is the actual URL of the accessanalyzer.html which is:我想要得到的是 accessanalyzer.html 的实际 URL,它是:

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/accessanalyzer.html

When I hover over the href or click on it will take me to that URL.当我将鼠标悬停在 href 上或单击它时,会将我带到该 URL。

How can I get the URL?我怎样才能得到网址? Also what is the name of the concept of having an href that has an embedded link and not actual text called?还有具有嵌入链接而不是实际文本的 href 概念的名称是什么? (so I can research more) (所以我可以研究更多)

You would have to some extra processing after retrieving the HREF value.检索 HREF 值后,您必须进行一些额外的处理。

What you would need to do is get the base URL path of the source page, and append the HREF value.您需要做的是获取源页面的基本 URL 路径,并附加 HREF 值。

Let's say the source page is "https://example.com/stuff/source.html", and that page contains a link with HREF "foo.html".假设源页面是“https://example.com/stuff/source.html”,该页面包含一个带有 HREF“foo.html”的链接。 You would need to get the base URL path of the source page (which is "https://example.com/stuff/" and append the HREF value to get "https://example.com/stuff/foo.html".您需要获取源页面的基本 URL 路径(即“https://example.com/stuff/”并附加 HREF 值以获取“https://example.com/stuff/foo.html” .

You can use the dirname function to help you:您可以使用dirname函数来帮助您:

>>> dir = os.path.dirname('https://example.com/stuff/source.html')
>>> dir
'https://example.com/stuffl'

and then join the 2 parts together:然后将两部分连接在一起:

>>> os.path.join(dir, "foo.html")
'https://example.com/stuff/foo.html'

Similar to what's described here.类似于这里描述的内容。 I believe you're actually going to need some kind of webdriver automator (Selenium, etc.) to simulate the hover-over and get the data.我相信您实际上需要某种 webdriver 自动程序(Selenium 等)来模拟悬停并获取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM