简体   繁体   中英

How can I jump to next page in Scrapy

I'm trying to scrape the results from here using scrapy. The problem is that not all of the classes appear on the page until the 'load more results' tab is clicked.

The problem can be seen here:

在此处输入图片说明

My code looks like this:

class ClassCentralSpider(CrawlSpider):
    name = "class_central"
    allowed_domains = ["www.class-central.com"]
    start_urls = (
        'https://www.class-central.com/courses/recentlyAdded',
    )
    rules = (
        Rule(
            LinkExtractor(
                # allow=("index\d00\.html",),
                restrict_xpaths=('//div[@id="show-more-courses"]',)
            ),
            callback='parse',
            follow=True
        ),
    )

def parse(self, response):
    x = response.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print item['name']

    pass

The second page for this website seems to be generated via AJAX call. If you look into network tab of any browser inspection tool, you'll see something like:

萤火虫网络标签

In this case it seems to be retrieving a json file from https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134

Now it seems that url parameter _=1469471093134 does nothing so you can just trim it away to: https://www.class-central.com/maestro/courses/recentlyAdded?page=2
The return json contains html code for the next page:

# so you just need to load it up with 
data = json.loads(response.body) 
# and convert it to scrapy selector - 
sel = Selector(text=data['table'])

To replicate this in your code try something like:

from w3lib.url import add_or_replace_parameter 
def parse(self, response):
    # check if response is json, if so convert to selector
    if response.meta.get('is_json',False):
        # convert the json to scrapy.Selector here for parsing
        sel = Selector(text=json.loads(response.body)['table'])
    else:
        sel = Selector(response) 
    # parse page here for items
    x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print(item['name'])
    # do next page
    next_page_el = respones.xpath("//div[@id='show-more-courses']")
    if next_page_el:  # there is next page
        next_page = response.meta.get('page',1) + 1
        # make next page url
        url = add_or_replace_parameter(url, 'page', next_page)
        yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM