简体   繁体   中英

How to extract all links (href + text) from a page with Scrapy

I need to extract all the links on a page, and from each link I need to get the href and its respective text .

If any page has a total of 3 links:

<a href="https://www.stackoverflow.com">This is the Stackoverflow page</a>
<a href="https://example.com">This is an example link</a>
<a href="tel:+99999999999">This is my phone</a>

I would need a result like this:

links = {
    "https://www.stackoverflow.com": "This is the Stackoverflow page",
    "https://example.com": "This is an example link",
    "tel:+99999999999": "This is my phone"
}

So the goal is to know that the text X belongs to href Y, and the page is not specific, it can be any one.

I've tried two other ways to no avail:

  1. Returns only href :

     for r in response.css('a::attr(href)').getall(): print(r)
  2. Does not return the href , only the text

     le = LinkExtractor() for link in le.extract_links(response): print(link.url) print(link.text)

And it needs to be with Scrapy, BeautifulSoup doesn't fit.

To keep with the format you posted:

for r in response.css('a'):
    url = r.css('::attr(href)').get()
    txt = r.css('::text').get()

response.css('a') will return a list of selectors .

r will be a different selector in each iteration of the for loop.

Since r is a selector, you can use the .css() (or .xpath() ) method to access any path or attribute of that node. In this case, text and href.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM