Scrapy蜘蛛不遵循链接

Question

我正在写一个scrapy蜘蛛从主页抓取今天的NYT文章，但由于某种原因它不遵循任何链接。 当我在scrapy shell http://www.nytimes.com实例化链接提取器时，它成功地使用le.extract_links(response)提取了一个文章URL列表，但我无法获取我的抓取命令（ scrapy crawl nyt -o out.json ）除了主页以外的任何东西。 我有点像我的智慧。 是因为主页不会从解析函数中产生文章吗？ 任何帮助是极大的赞赏。

from datetime import date                                                       

import scrapy                                                                   
from scrapy.contrib.spiders import Rule                                         
from scrapy.contrib.linkextractors import LinkExtractor                         


from ..items import NewsArticle                                                 

with open('urls/debug/nyt.txt') as debug_urls:                                  
    debug_urls = debug_urls.readlines()                                         

with open('urls/release/nyt.txt') as release_urls:                              
    release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                 

today = date.today().strftime('%Y/%m/%d')                                       
print today                                                                     


class NytSpider(scrapy.Spider):                                                 
    name = "nyt"                                                                
    allowed_domains = ["nytimes.com"]                                           
    start_urls = release_urls                                                      
    rules = (                                                                      
            Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),          
                 callback='parse', follow=True),                                   
    )                                                                              

    def parse(self, response):                                                     
        article = NewsArticle()                                                                         
        for story in response.xpath('//article[@id="story"]'):                     
            article['url'] = response.url                                          
            article['title'] = story.xpath(                                        
                    '//h1[@id="story-heading"]/text()').extract()                  
            article['author'] = story.xpath(                                       
                    '//span[@class="byline-author"]/@data-byline-name'             
            ).extract()                                                         
            article['published'] = story.xpath(                                 
                    '//time[@class="dateline"]/@datetime').extract()            
            article['content'] = story.xpath(                                   
                    '//div[@id="story-body"]/p//text()').extract()              
            yield article

Answer 1

我找到了解决问题的方法。 我做错了两件事：

如果我想让它自动抓取子CrawlSpider我需要CrawlSpider而不是Spider 。
使用CrawlSpider ，我需要使用回调函数而不是重写parse 。 根据文档，重写parse会破坏CrawlSpider功能。

Scrapy蜘蛛不遵循链接

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-06-18 20:49:39

Scrapy蜘蛛不遵循链接

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-06-18 20:49:39

解决方案1
3 已采纳 2015-06-18 20:49:39