简体   繁体   English

Scrapy蜘蛛不遵循链接

[英]Scrapy Spider not Following Links

I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. 我正在写一个scrapy蜘蛛从主页抓取今天的NYT文章,但由于某种原因它不遵循任何链接。 When I instantiate the link extractor in scrapy shell http://www.nytimes.com , it successfully extracts a list of article urls with le.extract_links(response) , but I can't get my crawl command ( scrapy crawl nyt -o out.json ) to scrape anything but the homepage. 当我在scrapy shell http://www.nytimes.com实例化链接提取器时,它成功地使用le.extract_links(response)提取了一个文章URL列表,但我无法获取我的抓取命令( scrapy crawl nyt -o out.json )除了主页以外的任何东西。 I'm sort of at my wit's end. 我有点像我的智慧。 Is it because the homepage does not yield an article from the parse function? 是因为主页不会从解析函数中产生文章吗? Any help is greatly appreciated. 任何帮助是极大的赞赏。

from datetime import date                                                       

import scrapy                                                                   
from scrapy.contrib.spiders import Rule                                         
from scrapy.contrib.linkextractors import LinkExtractor                         


from ..items import NewsArticle                                                 

with open('urls/debug/nyt.txt') as debug_urls:                                  
    debug_urls = debug_urls.readlines()                                         

with open('urls/release/nyt.txt') as release_urls:                              
    release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                 

today = date.today().strftime('%Y/%m/%d')                                       
print today                                                                     


class NytSpider(scrapy.Spider):                                                 
    name = "nyt"                                                                
    allowed_domains = ["nytimes.com"]                                           
    start_urls = release_urls                                                      
    rules = (                                                                      
            Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),          
                 callback='parse', follow=True),                                   
    )                                                                              

    def parse(self, response):                                                     
        article = NewsArticle()                                                                         
        for story in response.xpath('//article[@id="story"]'):                     
            article['url'] = response.url                                          
            article['title'] = story.xpath(                                        
                    '//h1[@id="story-heading"]/text()').extract()                  
            article['author'] = story.xpath(                                       
                    '//span[@class="byline-author"]/@data-byline-name'             
            ).extract()                                                         
            article['published'] = story.xpath(                                 
                    '//time[@class="dateline"]/@datetime').extract()            
            article['content'] = story.xpath(                                   
                    '//div[@id="story-body"]/p//text()').extract()              
            yield article  

I have found the solution to my problem. 我找到了解决问题的方法。 I was doing 2 things wrong: 我做错了两件事:

  1. I needed to subclass CrawlSpider rather than Spider if I wanted it to automatically crawl sublinks. 如果我想让它自动抓取子CrawlSpider我需要CrawlSpider而不是Spider
  2. When using CrawlSpider , I needed to use a callback function rather than overriding parse . 使用CrawlSpider ,我需要使用回调函数而不是重写parse As per the docs, overriding parse breaks CrawlSpider functionality. 根据文档,重写parse会破坏CrawlSpider功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM