Using python scrapy to extract links from a webpage

Question

I am a beginner with python and using scrapy to extract links from the following webpage http://www.basketball-reference.com/leagues/NBA_2015_games.html .

The code that I have written is

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem

class BasketballSpider(CrawlSpider):

   name = 'basketball'
   allowed_domains = ['basketball-reference.com/']
   start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
   rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]

   def parse_item(self, response):
       item = BasketballItem()
       item['url'] = response.url
       return item

I run this code through the command prompt, but the file created does not have any links. Could someone please help?

Answer 1

It cannot find the links, fix you regular expression in the rule:

rules = [
    Rule(LinkExtractor(allow='boxscores/\w+'))
]

Also, you don't have to set the callback when it is called parse_item - it is a default.

And allow can be set as a string also.

Answer 2

rules = [
         Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]

Using python scrapy to extract links from a webpage

Question

2 answers

solution1
1 2015-03-18 09:44:23

solution2
0 ACCPTED

Using python scrapy to extract links from a webpage

Question

2 answers

solution1 1 2015-03-18 09:44:23

solution2 0 ACCPTED

solution1
1 2015-03-18 09:44:23

solution2
0 ACCPTED