简体   繁体   English

Scrapy没有抓取所有页面

[英]Scrapy not crawling all the pages

I am trying to crawl sites in a very basic manner. 我试图以非常基本的方式抓取网站。 But Scrapy isn't crawling all the links. 但Scrapy没有抓取所有链接。 I will explain the scenario as follows- 我会解释如下情况─

main_page.html -> contains links to a_page.html, b_page.html, c_page.html main_page.html - >包含指向a_page.html,b_page.html,c_page.html的链接
a_page.html -> contains links to a1_page.html, a2_page.html a_page.html - >包含指向a1_page.html,a2_page.html的链接
b_page.html -> contains links to b1_page.html, b2_page.html b_page.html - >包含指向b1_page.html,b2_page.html的链接
c_page.html -> contains links to c1_page.html, c2_page.html c_page.html - >包含指向c1_page.html,c2_page.html的链接
a1_page.html -> contains link to b_page.html a1_page.html - >包含指向b_page.html的链接
a2_page.html -> contains link to c_page.html a2_page.html - >包含指向c_page.html的链接
b1_page.html -> contains link to a_page.html b1_page.html - >包含指向a_page.html的链接
b2_page.html -> contains link to c_page.html b2_page.html - >包含指向c_page.html的链接
c1_page.html -> contains link to a_page.html c1_page.html - >包含指向a_page.html的链接
c2_page.html -> contains link to main_page.html c2_page.html - >包含指向main_page.html的链接

I am using the following rule in CrawlSpider - 我在CrawlSpider中使用以下规则 -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

But the crawl results are as follows - 但抓取结果如下 -

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html ) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html ) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html ) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html ) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished) DEBUG:Crawled(200)http://localhost/main_page.html>(referer:None)2011-12-05 09:56:07 + 0530 [test_spider] DEBUG:Crawled(200)http:// localhost / a_page。 html>(referer: http://localhost/main_page.html )2011-12-05 09:56:07 + 0530 [test_spider] DEBUG:Crawled(200)http://localhost/a1_page.html>(referer: http ://localhost/a_page.html )2011-12-05 09:56:07 + 0530 [test_spider] DEBUG:Crawled(200)http://localhost/b_page.html>(referer: http:// localhost / a1_page .html )2011-12-05 09:56:07 + 0530 [test_spider] DEBUG:Crawled(200)http://localhost/b1_page.html>(referer: http://localhost/b_page.html )2011-12 -05 09:56:07 + 0530 [test_spider]信息:关闭蜘蛛(已完成)

It is not crawling all the pages. 它没有抓取所有页面。

NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc. 注意 - 我已经在Scrapy Doc中指出了BFO中的爬行。

What am I missing? 我错过了什么?

Scrapy will by default filter out all duplicate requests. 默认情况下,Scrapy会过滤掉所有重复的请求。

You can circumvent this by using (example): 您可以通过使用(示例)来避免这种情况:

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. dont_filter(boolean) - 表示调度程序不应过滤此请求。 This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. 当您想要多次执行相同的请求时,可以使用此选项来忽略重复过滤器。 Use it with care, or you will get into crawling loops. 小心使用它,否则您将进入爬行循环。 Default to False. 默认为False。

Also see the Request object documentation 另请参阅Request对象文档

I had a similar problem today, although I was using a custom spider. 我今天遇到了类似的问题,虽然我使用的是自定义蜘蛛。 It turned out that the website was limiting my crawl because my useragent was scrappy-bot 原来,该网站限制了我的抓取,因为我的useragent是scrappy-bot

try changing your user agent and try again. 尝试更改您的用户代理,然后重试。 Change it to maybe that of a known browser 将其更改为可能的已知浏览器

Another thing you might want to try is adding a delay. 您可能想要尝试的另一件事是添加延迟。 Some websites prevent scraping if the time between request is too small. 如果请求之间的时间太短,某些网站会阻止抓取。 Try adding a DOWNLOAD_DELAY of 2 and see if that helps 尝试添加DOWNLOAD_DELAY为2,看看是否有帮助

More information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html 有关DOWNLOAD_DELAY的更多信息,请访问http://doc.scrapy.org/en/0.14/topics/settings.html

Maybe a lot of the URLs are duplicates. 也许很多网址都是重复的。 Scrapy avoid duplicates since it is inefficient. Scrapy避免重复,因为它效率低下。 From what I see from your explanation since you use follow URL rule, of course, there is a lot of duplicates. 从我使用跟随URL规则时的解释看,当然,有很多重复。

If you want to be sure and see the proof in the log, add this to your settings.py . 如果您想确定并在日志中查看证明,请将其添加到settings.py

DUPEFILTER_DEBUG = True

And you'll see this kind of lines in the log: 你会在日志中看到这种线条:

2016-09-20 17:08:47 [scrapy] DEBUG: Filtered duplicate request: http://www.example.org/example.html> 2016-09-20 17:08:47 [scrapy] DEBUG:过滤重复请求:http://www.example.org/example.html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM