简体   繁体   中英

Callback Function never called using Scrapy

I am new to Scrapy and python. I have spent several hours trying to debug and look for helpful responses but I am still stuck. I am trying to extract data from www.pro-football-reference.com. This is the code I have right now

import scrapy

from nfl_predictor.items import NflPredictorItem

class NflSpider(scrapy.Spider):
   name = "nfl2"
   allowed_domains = ["http://www.pro-football-reference.com/"]
   start_url = [
    "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
   ]

    def parse(self, response):
        print "parse"
        for href in response.xpath('// [@id="page_content"]/div[1]/table/tr/td/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_game_content)

    def parse_game_content(self, response):
        print "parse_game_content"
        items = []
        for sel in response.xpath('//table[@id = "team_stats"]/tr'):
            item = NflPredictorItem()
            item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract()
            item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract()
        items.append(item)
    return items

I used the parse command for debugging and with this command

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"

I get the following output

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[]

# Requests  -----------------------------------------------------------------
[<GET http://www.pro-football-reference.com/years/2015/games.htm>,
 <GET http://www.nfl.com/scores/2015/REG1>,
 <GET http://www.pro-football-reference.com/boxscores/201509130buf.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130chi.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130crd.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130dal.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130den.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130htx.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130jax.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130nyj.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130rai.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130ram.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130sdg.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130tam.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130was.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509140atl.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509140sfo.htm>]

Why is it that it is logging the requests for the links I want, but it is never entering the parse_game_content function to actually scrape the data? I have also tested the parse_game_content function as the parse function to make sure it is scraping the right data and it works properly in that case.

Thank you for your help!

By default parse command fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.In your case, it parses only parse function. Change the command to give --callback like:

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" --callback=parse_game_content

and also, it is better to change your parse_game_content function as follows

  def parse_game_content(self, response): print "parse_game_content" for sel in response.xpath('//table[@id="team_stats"]/tr'): item = NflPredictorItem() item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract() item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract() yield item 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM