I am new to Scrapy and python. I have spent several hours trying to debug and look for helpful responses but I am still stuck. I am trying to extract data from www.pro-football-reference.com. This is the code I have right now
import scrapy
from nfl_predictor.items import NflPredictorItem
class NflSpider(scrapy.Spider):
name = "nfl2"
allowed_domains = ["http://www.pro-football-reference.com/"]
start_url = [
"http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
]
def parse(self, response):
print "parse"
for href in response.xpath('// [@id="page_content"]/div[1]/table/tr/td/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_game_content)
def parse_game_content(self, response):
print "parse_game_content"
items = []
for sel in response.xpath('//table[@id = "team_stats"]/tr'):
item = NflPredictorItem()
item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract()
item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract()
items.append(item)
return items
I used the parse command for debugging and with this command
scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
I get the following output
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[]
# Requests -----------------------------------------------------------------
[<GET http://www.pro-football-reference.com/years/2015/games.htm>,
<GET http://www.nfl.com/scores/2015/REG1>,
<GET http://www.pro-football-reference.com/boxscores/201509130buf.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130chi.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130crd.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130dal.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130den.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130htx.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130jax.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130nyj.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130rai.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130ram.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130sdg.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130tam.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509130was.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509140atl.htm>,
<GET http://www.pro-football-reference.com/boxscores/201509140sfo.htm>]
Why is it that it is logging the requests for the links I want, but it is never entering the parse_game_content function to actually scrape the data? I have also tested the parse_game_content function as the parse function to make sure it is scraping the right data and it works properly in that case.
Thank you for your help!
By default parse command fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.In your case, it parses only parse function. Change the command to give --callback
like:
scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" --callback=parse_game_content
and also, it is better to change your parse_game_content function as follows
def parse_game_content(self, response): print "parse_game_content" for sel in response.xpath('//table[@id="team_stats"]/tr'): item = NflPredictorItem() item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract() item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract() yield item
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.