How do I request callback on a URL that I first scraped to get?

Question

Just started toying around with scrapy for a bit to help scrape some fantasy basketball stats. My main problem is in my spider - how do I scrape the href attribute of a link and then callback another parser on that url?

I looked into link extractors , and I think this might be my solution but I'm not sure. I've re-read it over and over again, and still am confused on where to start. The following is the code I have so far.

def parse_player(self, response):
    player_name = "Steven Adams"
    sel = Selector(response)
    player_url = sel.xpath("//a[text()='%s']/@href" % player_name).extract()
    return Request("http://sports.yahoo.com/'%s'" % player_url, callback = self.parse_curr_stats)

def parse_curr_stats(self, response):
    sel = Selector(response)
    stats = sel.xpath("//div[@id='mediasportsplayercareerstats']//table[@summary='Player']/tbody/tr[last()-1]")
    items =[]

    for stat in stats:
        item = player_item()
        item['fgper'] = stat.xpath("td[@title='Field Goal Percentage']/text()").extract()
        item['ftper'] = stat.xpath("td[@title='Free Throw Percentage']/text()").extract()
        item['treys'] = stat.xpath("td[@title='3-point Shots Made']/text()").extract() 
        item['pts'] = stat.xpath("td[@title='Points']/text()").extract()
        item['reb'] = stat.xpath("td[@title='Total Rebounds']/text()").extract()
        item['ast'] = stat.xpath("td[@title='Assists']/text()").extract()
        item['stl'] = stat.xpath("td[@title='Steals']/text()").extract()
        item['blk'] = stat.xpath("td[@title='Blocked Shots']/text()").extract()
        item['tov'] = stat.xpath("td[@title='Turnovers']/text()").extract()
        item['fga'] = stat.xpath("td[@title='Field Goals Attempted']/text()").extract()
        item['fgm'] = stat.xpath("td[@title='Field Goals Made']/text()").extract()
        item['fta'] = stat.xpath("td[@title='Free Throws Attempted']/text()").extract()
        item['ftm'] = stat.xpath("td[@title='Free Throws Made']/text()").extract()
        items.append(item)
    return items

So as you can see, in the first parse function, you're given a name, and you look for the link on the page that will guide you to their individual page, which is stored in "player_url". How do I then go to that page and run the 2nd parser on it?

I feel as if I am completely glossing over something and if someone could shed some light it would be greatly appreciated!

Answer 1

When you want to send a Request object, just use yield rather than return like this:

def parse_player(self, response):
    ...... 
    yield Request(......)

If there are many Request objects that you want to send in a single parse method, a best practic is like this:

def parse_player(self, response):
    ......
    res_objs = []
    # then add every Request object into 'res_objs' list,
    # and in the end of the method, do the following:
    for req in res_objs:
        yield req

I think when the scrapy spider is running, it will handle requests under the hood like this:

# handle requests
for req_obj in self.parse_play():
    # do something with *Request* object

So just remember use yield to send Request objects.

How do I request callback on a URL that I first scraped to get?

Question

1 answers

solution1
0 ACCPTED 2013-12-08 03:00:36

How do I request callback on a URL that I first scraped to get?

Question

1 answers

solution1 0 ACCPTED 2013-12-08 03:00:36

solution1
0 ACCPTED 2013-12-08 03:00:36