Just started toying around with scrapy for a bit to help scrape some fantasy basketball stats. My main problem is in my spider - how do I scrape the href attribute of a link and then callback another parser on that url?
I looked into link extractors , and I think this might be my solution but I'm not sure. I've re-read it over and over again, and still am confused on where to start. The following is the code I have so far.
def parse_player(self, response):
player_name = "Steven Adams"
sel = Selector(response)
player_url = sel.xpath("//a[text()='%s']/@href" % player_name).extract()
return Request("http://sports.yahoo.com/'%s'" % player_url, callback = self.parse_curr_stats)
def parse_curr_stats(self, response):
sel = Selector(response)
stats = sel.xpath("//div[@id='mediasportsplayercareerstats']//table[@summary='Player']/tbody/tr[last()-1]")
items =[]
for stat in stats:
item = player_item()
item['fgper'] = stat.xpath("td[@title='Field Goal Percentage']/text()").extract()
item['ftper'] = stat.xpath("td[@title='Free Throw Percentage']/text()").extract()
item['treys'] = stat.xpath("td[@title='3-point Shots Made']/text()").extract()
item['pts'] = stat.xpath("td[@title='Points']/text()").extract()
item['reb'] = stat.xpath("td[@title='Total Rebounds']/text()").extract()
item['ast'] = stat.xpath("td[@title='Assists']/text()").extract()
item['stl'] = stat.xpath("td[@title='Steals']/text()").extract()
item['blk'] = stat.xpath("td[@title='Blocked Shots']/text()").extract()
item['tov'] = stat.xpath("td[@title='Turnovers']/text()").extract()
item['fga'] = stat.xpath("td[@title='Field Goals Attempted']/text()").extract()
item['fgm'] = stat.xpath("td[@title='Field Goals Made']/text()").extract()
item['fta'] = stat.xpath("td[@title='Free Throws Attempted']/text()").extract()
item['ftm'] = stat.xpath("td[@title='Free Throws Made']/text()").extract()
items.append(item)
return items
So as you can see, in the first parse function, you're given a name, and you look for the link on the page that will guide you to their individual page, which is stored in "player_url". How do I then go to that page and run the 2nd parser on it?
I feel as if I am completely glossing over something and if someone could shed some light it would be greatly appreciated!
When you want to send a Request
object, just use yield
rather than return
like this:
def parse_player(self, response):
......
yield Request(......)
If there are many Request objects that you want to send in a single parse method, a best practic is like this:
def parse_player(self, response):
......
res_objs = []
# then add every Request object into 'res_objs' list,
# and in the end of the method, do the following:
for req in res_objs:
yield req
I think when the scrapy spider is running, it will handle requests under the hood like this:
# handle requests
for req_obj in self.parse_play():
# do something with *Request* object
So just remember use yield to send Request
objects.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.