简体   繁体   中英

Using python scrapy to extract links from a webpage

I am a beginner with python and using scrapy to extract links from the following webpage http://www.basketball-reference.com/leagues/NBA_2015_games.html .

The code that I have written is

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem

class BasketballSpider(CrawlSpider):

   name = 'basketball'
   allowed_domains = ['basketball-reference.com/']
   start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
   rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]

   def parse_item(self, response):
       item = BasketballItem()
       item['url'] = response.url
       return item

I run this code through the command prompt, but the file created does not have any links. Could someone please help?

It cannot find the links, fix you regular expression in the rule:

rules = [
    Rule(LinkExtractor(allow='boxscores/\w+'))
]

Also, you don't have to set the callback when it is called parse_item - it is a default.

And allow can be set as a string also.

rules = [
         Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM