I am a beginner with python and using scrapy to extract links from the following webpage http://www.basketball-reference.com/leagues/NBA_2015_games.html .
The code that I have written is
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem
class BasketballSpider(CrawlSpider):
name = 'basketball'
allowed_domains = ['basketball-reference.com/']
start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]
def parse_item(self, response):
item = BasketballItem()
item['url'] = response.url
return item
I run this code through the command prompt, but the file created does not have any links. Could someone please help?
It cannot find the links, fix you regular expression in the rule:
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'))
]
Also, you don't have to set the callback
when it is called parse_item
- it is a default.
And allow
can be set as a string also.
rules = [
Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.