I am trying to extract data from this link with Scrapy. I am looking to loop through these urls with page=1 through like the top 100 pages and extract every instance of <a href=\\"/@eberhardgross\\">\\n
for example. Ultimately just trying to grab the username there but there are other <a href="">
on the page but if I could extract just the username that would be great but if I have to get all <a href="">
that's fine I can sort them and get just the @. Just wondering if I can do this via scrapy?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = "https://www.pexels.com/leaderboard/all-time.js?format=js&seed=&page=%(page_number)s&type="
page_to_crawl = 100
for page_number in range(page_to_crawl):
yield scrapy.Request(url %{'page_number': page_number}, self.parse)
def parse(self, response):
usernames = response.xpath('//a[contains(@href, "@")]/@href').getall()
To crawl several pages you can use start_requests
to iterate on pages:
def start_requests(self):
url = "https://www.pexels.com/leaderboard/all-time.js?format=js&seed=&page=%(page_number)s&type="
page_to_crawl = 100
for page_number in range(page_to_crawl):
yield scrapy.Request(url %{'page_number': page_number}, self.parse)
And in your parse
method you can get HREFs that contains @ in it by xpath:
def parse(self, response):
usernames = response.xpath('//a[contains(@href, "@")]/@href').getall()
yield {
'usernames': usernames
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.