简体   繁体   English

python/scrapy 问题:如何避免无限循环

[英]python/scrapy question: How to avoid endless loops

I am using the web-scraping framework, scrapy, to data mine some sites.我正在使用网络抓取框架 scrapy 来挖掘一些网站的数据。 I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button.我正在尝试使用 CrawlSpider 并且页面有一个“返回”和“下一步”按钮。 The URLs are in the format URL 的格式为

www.qwerty.com/###

where ### is a number that increments every time the next button is pressed.其中### 是每次按下下一个按钮时递增的数字。 How do I format the rules so that an endless loop doesn't occur.如何格式化规则,以免发生无限循环。

Here is my rule:这是我的规则:

rules = (
        Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
    ),
)

Endless loop shouldn't happen.不应该发生无限循环。 Scrapy will filter out duplicate urls. Scrapy 将过滤掉重复的网址。

what makes you think the program will go into infinite loop, how have you tested it?是什么让你认为程序会 go 进入无限循环,你是如何测试的? scrapy wont download a url if it has already done it before.如果 scrapy 之前已经下载过 url,则不会下载它。 Did you try to go through all the pages, what happens when you click next on the last page?您是否尝试通过所有页面 go,当您在最后一页单击下一步时会发生什么?

You can get into infinite loop If the site generates a new number every time the next link is pressed.如果每次按下下一个链接时站点都会生成一个新数字,您可能会陷入无限循环。 Although the case is broken site code but you can put a limit on the max number of pages in your code to avoid looping indefinitely.尽管案例是站点代码损坏,但您可以限制代码中的最大页面数,以避免无限循环。

You can set a limit on number of links to follow: use DEPTH_LIMIT setting.您可以设置要关注的链接数量限制:使用DEPTH_LIMIT设置。

Alternatively you can check the current depth in a parse callback function:或者,您可以在解析回调 function 中检查当前深度:

def parse(self, response):
    if response.meta['depth'] > 100:
        print 'Loop?'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM