I need to extract all the links on a page, and from each link I need to get the href
and its respective text
.
If any page has a total of 3 links:
<a href="https://www.stackoverflow.com">This is the Stackoverflow page</a>
<a href="https://example.com">This is an example link</a>
<a href="tel:+99999999999">This is my phone</a>
I would need a result like this:
links = {
"https://www.stackoverflow.com": "This is the Stackoverflow page",
"https://example.com": "This is an example link",
"tel:+99999999999": "This is my phone"
}
So the goal is to know that the text
X belongs to href
Y, and the page is not specific, it can be any one.
I've tried two other ways to no avail:
Returns only href
:
for r in response.css('a::attr(href)').getall(): print(r)
Does not return the href
, only the text
le = LinkExtractor() for link in le.extract_links(response): print(link.url) print(link.text)
And it needs to be with Scrapy, BeautifulSoup doesn't fit.
To keep with the format you posted:
for r in response.css('a'):
url = r.css('::attr(href)').get()
txt = r.css('::text').get()
response.css('a')
will return a list of selectors .
r
will be a different selector in each iteration of the for loop.
Since r
is a selector, you can use the .css()
(or .xpath()
) method to access any path or attribute of that node. In this case, text and href.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.