简体   繁体   English

Python Scrapy仅爬网start_urls然后停止。 如何更深入?

[英]Python Scrapy only crawls start_urls and then stops. How to go deeper?

Why is it that Scrapy only crawls the start_urls and then stops? 为什么Scrapy仅爬网start_urls然后停止? Is there a way to have Scrapy crawl through all pages in a directory tree of a website, such as http://www.example.com/directory ? 有没有办法让Scrapy爬网网站目录树中的所有页面,例如http://www.example.com/directory Or, is there a way to have Scrapy crawl deeper into all links on the start_urls pages? 还是有办法让Scrapy深入到start_urls页面上的所有链接?

class DmozSpider(CrawlSpider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
                      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
                      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
                      ]

        rules = [
               Rule(SgmlLinkExtractor(allow=('', )), follow=True),
               Rule(SgmlLinkExtractor(allow=('', )), callback='parse_item')
               ]

        def parse_item(self, response):
          print response.url

        def parse(self, response):
          print response.url

Here's the code in my main.py file: 这是我的main.py文件中的代码:

dmozSpider = DmozSpider()
spider = dmozSpider

settings = get_project_settings()

crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

在DmozSpider类中删除parse(),然后parse_item()将获得比start_urls更多的信息

To elaborate on @stevetronix's answer a bit: 要详细说明@stevetronix的答案:

You are not supposed to override the parse() method when using the CrawlSpider . 使用CrawlSpider时,您不应覆盖parse()方法。 You should set a custom callback in your Rule with a different name. 您应该在Rule使用其他名称设置自定义callback
Here is the excerpt from the official documentation : 这是官方文档的摘录:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM