简体   繁体   English

Scrapy CrawlSpider没有关注链接

[英]Scrapy CrawlSpider is not following Links

I'm trying to crawl a page that uses next buttons to move to new pages using scrapy. 我正在尝试抓取使用下一页按钮使用scrapy移至新页面的页面。 I'm using an instance of crawl spider and have defined the Linkextractor to extract new pages to follow. 我正在使用抓取蜘蛛的实例,并已定义Linkextractor来提取要遵循的新页面。 However, the spider just crawls the start url and stops at that. 但是,蜘蛛程序仅抓取起始网址并在此停止。 I've added the spider code and the log. 我已经添加了蜘蛛代码和日志。 Anyone has any idea why the spider is not able to crawl the pages. 任何人都知道为什么蜘蛛无法爬网页面。

        from scrapy.spiders import CrawlSpider, Rule
        from scrapy.linkextractors import LinkExtractor
        from realcommercial.items import RealcommercialItem
        from scrapy.selector import Selector
        from scrapy.http import Request

        class RealCommercial(CrawlSpider):
            name = "realcommercial"
            allowed_domains = ["realcommercial.com.au"]
            start_urls = [
                "http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date"
        ]
            rules = [Rule(LinkExtractor( allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']),

                           callback='parse_response',
                           process_links='process_links',
                           follow=True),
                     Rule(LinkExtractor( allow = []),

                           callback='parse_response',
                           process_links='process_links',
                           follow=True)]


            def parse_response(self, response):        
                sel = Selector(response)
                sites = sel.xpath("//a[@class='details']")
                #items = []
                for site in sites:
                    item = RealcommercialItem()
                    link = site.xpath('@href').extract()
                    #print link, '\n\n'
                    item['link'] = link
                    link = 'http://www.realcommercial.com.au/' + str(link[0])
                    #print 'link!!!!!!=', link
                    new_request = Request(link, callback=self.parse_file_page)
                    new_request.meta['item'] = item
                    yield new_request
                    #items.append(item)
                yield item
                return

            def process_links(self, links):
                print 'inside process links'
                for i, w in enumerate(links):
                    print w.url,'\n\n\n'
                    w.url = "http://www.realcommercial.com.au/" + w.url
                    print w.url,'\n\n\n'
                    links[i] = w

                return links

            def parse_file_page(self, response):
                #item passed from request
                #print 'parse_file_page!!!'
                item = response.meta['item']
                #selector
                sel = Selector(response)
                title = sel.xpath('//*[@id="listing_address"]').extract()
                #print title
                item['title'] = title

                return item

Log 日志记录

                2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial)
                2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot
                o
                2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're
                alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial.
                spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'}
                2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter
                , TelnetConsole, LogStats, CoreStats, SpiderState
                2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
                eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
                eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
                leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
                2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
                re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
                2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines:
                2015-11-29 15:42:57 [scrapy] INFO: Spider opened
                2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
                tems (at 0 items/min)
                2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
                2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial
                .com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l
                ist-date> (referer: None)
                2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished)
                2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats:
                {'downloader/request_bytes': 303,
                 'downloader/request_count': 1,
                 'downloader/request_method_count/GET': 1,
                 'downloader/response_bytes': 30599,
                 'downloader/response_count': 1,
                 'downloader/response_status_count/200': 1,
                 'finish_reason': 'finished',
                 'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000),
                 'log_count/DEBUG': 2,
                 'log_count/INFO': 7,
                 'response_received_count': 1,
                 'scheduler/dequeued': 1,
                 'scheduler/dequeued/memory': 1,
                 'scheduler/enqueued': 1,
                 'scheduler/enqueued/memory': 1,
                 'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)}
                2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished)

I got the answer myself. 我自己得到了答案。 There were two issues: 有两个问题:

  1. process_links was " http://www.realcommercial.com.au/ " although it was already there. 尽管process_links已经存在,但它仍然是“ http://www.realcommercial.com.au/ ”。 I thought it would give back the relative url. 我以为它会退还相对网址。
  2. The regular expression in link extractor was not correct. 链接提取器中的正则表达式不正确。

I made changes to both of these and it worked. 我对这两者都进行了更改,并且它起作用了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM