简体   繁体   中英

Can I extract this XHR data with Scrapy?

I am trying to extract data from this link with Scrapy. I am looking to loop through these urls with page=1 through like the top 100 pages and extract every instance of <a href=\\"/@eberhardgross\\">\\n for example. Ultimately just trying to grab the username there but there are other <a href=""> on the page but if I could extract just the username that would be great but if I have to get all <a href=""> that's fine I can sort them and get just the @. Just wondering if I can do this via scrapy?

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

def start_requests(self):
    url = "https://www.pexels.com/leaderboard/all-time.js?format=js&seed=&page=%(page_number)s&type="
    page_to_crawl = 100
    for page_number in range(page_to_crawl):
        yield scrapy.Request(url %{'page_number': page_number}, self.parse)

def parse(self, response):
    usernames = response.xpath('//a[contains(@href, "@")]/@href').getall()

To crawl several pages you can use start_requests to iterate on pages:

def start_requests(self):
    url = "https://www.pexels.com/leaderboard/all-time.js?format=js&seed=&page=%(page_number)s&type="
    page_to_crawl = 100
    for page_number in range(page_to_crawl):
        yield scrapy.Request(url %{'page_number': page_number}, self.parse)

And in your parse method you can get HREFs that contains @ in it by xpath:

def parse(self, response):
    usernames = response.xpath('//a[contains(@href, "@")]/@href').getall()
    yield {
         'usernames': usernames
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM