Callback Function never called using Scrapy

Question

I am new to Scrapy and python. I have spent several hours trying to debug and look for helpful responses but I am still stuck. I am trying to extract data from www.pro-football-reference.com. This is the code I have right now

import scrapy

from nfl_predictor.items import NflPredictorItem

class NflSpider(scrapy.Spider):
   name = "nfl2"
   allowed_domains = ["http://www.pro-football-reference.com/"]
   start_url = [
    "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
   ]

    def parse(self, response):
        print "parse"
        for href in response.xpath('// [@id="page_content"]/div[1]/table/tr/td/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_game_content)

    def parse_game_content(self, response):
        print "parse_game_content"
        items = []
        for sel in response.xpath('//table[@id = "team_stats"]/tr'):
            item = NflPredictorItem()
            item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract()
            item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract()
        items.append(item)
    return items

I used the parse command for debugging and with this command

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"

I get the following output

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[]

# Requests  -----------------------------------------------------------------
[<GET http://www.pro-football-reference.com/years/2015/games.htm>,
 <GET http://www.nfl.com/scores/2015/REG1>,
 <GET http://www.pro-football-reference.com/boxscores/201509130buf.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130chi.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130crd.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130dal.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130den.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130htx.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130jax.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130nyj.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130rai.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130ram.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130sdg.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130tam.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130was.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509140atl.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509140sfo.htm>]

Why is it that it is logging the requests for the links I want, but it is never entering the parse_game_content function to actually scrape the data? I have also tested the parse_game_content function as the parse function to make sure it is scraping the right data and it works properly in that case.

Thank you for your help!

Answer 1

By default parse command fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.In your case, it parses only parse function. Change the command to give --callback like:

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" --callback=parse_game_content

and also, it is better to change your parse_game_content function as follows

  def parse_game_content(self, response): print "parse_game_content" for sel in response.xpath('//table[@id="team_stats"]/tr'): item = NflPredictorItem() item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract() item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract() yield item

Callback Function never called using Scrapy

Question

1 answers

solution1
0 2016-01-14 11:23:07

Callback Function never called using Scrapy

Question

1 answers

solution1 0 2016-01-14 11:23:07

solution1
0 2016-01-14 11:23:07