简体   繁体   English

Scrapy-刮一页并刮下一页

[英]Scrapy — Scraping a page and scraping next pages

I am trying to scrape RateMyProfessors for professor statistics defined in my items.py file: 我正在尝试为我的items.py文件中定义的教授统计信息刮除RateMyProfessors:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class ScraperItem(Item):
    # define the fields for your item here like:
    numOfPages = Field() # number of pages of professors (usually 476)

    firstMiddleName = Field() # first (and middle) name
    lastName = Field() # last name
    numOfRatings = Field() # number of ratings
    overallQuality = Field() # numerical rating
    averageGrade = Field() # letter grade
    profile = Field() # url of professor profile

    pass

Here is my scraper_spider.py file: 这是我的scraper_spider.py文件:

import scrapy

from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor


class scraperSpider(scrapy.Spider):
    name = "scraper"
    allowed_domains = ["www.ratemyprofessors.com"]
    start_urls = [
    "http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
    ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
        )

    def parse(self, response):
        # professors = []
        numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])

        # create array of profile links
        profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()

        # for each of those links
        for profile in profiles:
            # define item
            professor = ScraperItem();

            # add profile to professor
            professor["profile"] = profile

            # pass each page to the parse_profile() method
            request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
                 callback=self.parse_profile)
            request.meta["professor"] = professor

            # add professor to array of professors
            yield request


    def parse_profile(self, response):
        professor = response.meta["professor"]

        if response.xpath('//*[@class="pfname"]'):
            # scrape each item from the link that was passed as an argument and add to current professor
            professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract() 

        if response.xpath('//*[@class="plname"]'):
            professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()

        if response.xpath('//*[@class="table-toggle rating-count active"]'):
            professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()

        if response.xpath('//*[@class="grade"]'):
            professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()

        if response.xpath('//*[@class="grade"]'):
            professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()

        return professor

# add string to rule.  linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"

My issue lies in the scraper_spider.py file above. 我的问题在于上面的scraper_spider.py文件。 The spider is supposed to go to this RateMyProfessors page and go to each individual professor and grab the info, then go back to the directory and get the next professor's info. 蜘蛛应该转到 RateMyProfessors页面并转到每个教授并获取信息,然后返回目录并获取下一位教授的信息。 After there are no more professors left on the page to scrape, it should find the href value of the next button and go to that page and follow the same method. 页面上没有多余的教授可以抓取之后,应该找到下一个按钮href值 ,然后转到该页面并采用相同的方法。

My scraper is able to scrape all the professors on page 1 of the directory, but it stops after because it won't go to the next page. 我的抓取工具可以抓取目录第1页上的所有教授,但是此后停止,因为它不会转到下一页。

Can you help my scraper successfully find and go to the next page? 您可以帮助我的刮板成功找到并转到下一页吗?

I tried to follow this StackOverflow question but it was too specific to be of use. 我试图遵循这个 StackOverflow问题,但是它太具体了而无法使用。

Your scraperSpider should inherit from CrawlSpider if you want to use the rules attribute. 如果要使用rules属性,则scraperSpider应该继承自CrawlSpider See the docs here . 此处查看文档。 Also be aware of this warning from the docs 另请注意文档中的此警告

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。

我通过完全忽略规则并遵循此文档的“ 以下链接”部分解决了我的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM