简体   繁体   English

抓痒的python CrawlSpider无法爬行

[英]scrapy python CrawlSpider not crawling

import scrapy 
from scrapy.spiders.crawl import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class MySpider(CrawlSpider):
    name = 'genericSpider'
    allowed_domains = ['example.com']
    start_urls = [url_1, url_2, url_3]

    rules = [
        Rule(
            LinkExtractor(),                     
            callback='parse',   
            follow=True        
        ),
    ]

    def parse(self, response): 
        hxs = scrapy.Selector(response)
        links = hxs.xpath('*//a/@href').extract()
        for link in links:
            print(link)
        print()

I'm attempting to crawl a website. 我正在尝试抓取网站。 For an example of my code, I'm just extracting all links and printing them out to the terminal. 对于我的代码示例,我只是提取所有链接并将它们打印到终端上。

This process works great for the urls in the start_urls, but it doesn't seem that the spider will crawl the extracted urls. 这个过程非常适合start_urls中的URL,但似乎爬虫似乎不会抓取提取的URL。

This is the point of the CrawlSpider, correct? 这是CrawlSpider的重点,对吗? visit a page, collect its links and visit all those links until it runs out of them? 访问页面,收集其链接并访问所有这些链接,直到其用尽?

I've been stuck for a few days, any help would be great. 我被困了几天,任何帮助都会很棒。

The problem is that you name your method parse . 问题是您将方法parse命名。 As per the documentation , this name should be avoided in case of using CrawlSpider as it leads to problems. 根据文档 ,在使用CrawlSpider情况下应避免使用此名称,因为它会导致问题。 Just rename the method to eg parse_link (and adjust the callback argument in Rule ) and it will work. 只需将方法重命名为parse_link (并在Rule调整callback参数),它将起作用。

Also, remember that allowed_domains attribute must match with URLs you intend to crawl. 另外,请记住, allowed_domains属性必须与您要爬网的URL匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM