How to extract all links (href + text) from a page with Scrapy

Question

I need to extract all the links on a page, and from each link I need to get the href and its respective text .

If any page has a total of 3 links:

<a href="https://www.stackoverflow.com">This is the Stackoverflow page</a>
<a href="https://example.com">This is an example link</a>
<a href="tel:+99999999999">This is my phone</a>

I would need a result like this:

links = {
    "https://www.stackoverflow.com": "This is the Stackoverflow page",
    "https://example.com": "This is an example link",
    "tel:+99999999999": "This is my phone"
}

So the goal is to know that the text X belongs to href Y, and the page is not specific, it can be any one.

I've tried two other ways to no avail:

Returns only href :

 for r in response.css('a::attr(href)').getall(): print(r)

Does not return the href , only the text

 le = LinkExtractor() for link in le.extract_links(response): print(link.url) print(link.text)

And it needs to be with Scrapy, BeautifulSoup doesn't fit.

Answer 1

To keep with the format you posted:

for r in response.css('a'):
    url = r.css('::attr(href)').get()
    txt = r.css('::text').get()

response.css('a') will return a list of selectors .

r will be a different selector in each iteration of the for loop.

Since r is a selector, you can use the .css() (or .xpath() ) method to access any path or attribute of that node. In this case, text and href.

How to extract all links (href + text) from a page with Scrapy

Question

1 answers

solution1
1 ACCPTED 2020-07-26 23:19:39

How to extract all links (href + text) from a page with Scrapy

Question

1 answers

solution1 1 ACCPTED 2020-07-26 23:19:39

solution1
1 ACCPTED 2020-07-26 23:19:39