如何使用 Scrapy 从页面中提取所有链接（href + 文本）

Question

我需要提取页面上的所有链接，并从每个链接中获取href及其各自的text 。

如果任何页面总共有 3 个链接：

<a href="https://www.stackoverflow.com">This is the Stackoverflow page</a>
<a href="https://example.com">This is an example link</a>
<a href="tel:+99999999999">This is my phone</a>

我需要这样的结果：

links = {
    "https://www.stackoverflow.com": "This is the Stackoverflow page",
    "https://example.com": "This is an example link",
    "tel:+99999999999": "This is my phone"
}

所以目标是要知道text X属于href Y，并且页面不具体，可以是任意一个。

我尝试了其他两种方法都无济于事：

仅返回href ：

 for r in response.css('a::attr(href)').getall(): print(r)

不返回href ，只返回text

 le = LinkExtractor() for link in le.extract_links(response): print(link.url) print(link.text)

它需要与 Scrapy 一起使用，BeautifulSoup 不适合。

Answer 1

为了保持您发布的格式：

for r in response.css('a'):
    url = r.css('::attr(href)').get()
    txt = r.css('::text').get()

response.css('a')将返回一个选择器列表。

r将在 for 循环的每次迭代中成为不同的选择器。

由于r是一个选择器，您可以使用.css() （或.xpath() ）方法访问该节点的任何路径或属性。 在这种情况下，文本和 href。

如何使用 Scrapy 从页面中提取所有链接（href + 文本）

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-26 23:19:39

如何使用 Scrapy 从页面中提取所有链接（href + 文本）

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-26 23:19:39

解决方案1
1 已采纳 2020-07-26 23:19:39