[英]Scrapy — Scraping a page and scraping next pages
I am trying to scrape RateMyProfessors for professor statistics defined in my items.py file: 我正在尝试为我的items.py文件中定义的教授统计信息刮除RateMyProfessors:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ScraperItem(Item):
# define the fields for your item here like:
numOfPages = Field() # number of pages of professors (usually 476)
firstMiddleName = Field() # first (and middle) name
lastName = Field() # last name
numOfRatings = Field() # number of ratings
overallQuality = Field() # numerical rating
averageGrade = Field() # letter grade
profile = Field() # url of professor profile
pass
Here is my scraper_spider.py file: 这是我的scraper_spider.py文件:
import scrapy
from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class scraperSpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["www.ratemyprofessors.com"]
start_urls = [
"http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
]
rules = (
Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
)
def parse(self, response):
# professors = []
numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])
# create array of profile links
profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()
# for each of those links
for profile in profiles:
# define item
professor = ScraperItem();
# add profile to professor
professor["profile"] = profile
# pass each page to the parse_profile() method
request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
callback=self.parse_profile)
request.meta["professor"] = professor
# add professor to array of professors
yield request
def parse_profile(self, response):
professor = response.meta["professor"]
if response.xpath('//*[@class="pfname"]'):
# scrape each item from the link that was passed as an argument and add to current professor
professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract()
if response.xpath('//*[@class="plname"]'):
professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()
if response.xpath('//*[@class="table-toggle rating-count active"]'):
professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()
return professor
# add string to rule. linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"
My issue lies in the scraper_spider.py file above. 我的问题在于上面的scraper_spider.py文件。 The spider is supposed to go to this RateMyProfessors page and go to each individual professor and grab the info, then go back to the directory and get the next professor's info. 蜘蛛应该转到此 RateMyProfessors页面并转到每个教授并获取信息,然后返回目录并获取下一位教授的信息。 After there are no more professors left on the page to scrape, it should find the href value of the next button and go to that page and follow the same method. 页面上没有多余的教授可以抓取之后,应该找到下一个按钮的href值 ,然后转到该页面并采用相同的方法。
My scraper is able to scrape all the professors on page 1 of the directory, but it stops after because it won't go to the next page. 我的抓取工具可以抓取目录第1页上的所有教授,但是此后停止,因为它不会转到下一页。
Can you help my scraper successfully find and go to the next page? 您可以帮助我的刮板成功找到并转到下一页吗?
I tried to follow this StackOverflow question but it was too specific to be of use. 我试图遵循这个 StackOverflow问题,但是它太具体了而无法使用。
Your scraperSpider
should inherit from CrawlSpider
if you want to use the rules
attribute. 如果要使用rules
属性,则scraperSpider
应该继承自CrawlSpider
。 See the docs here . 在此处查看文档。 Also be aware of this warning from the docs 另请注意文档中的此警告
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。
我通过完全忽略规则并遵循此文档的“ 以下链接”部分解决了我的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.