简体繁体 English

python scrapy如何知道访问的链接

[英]python scrapy how to know visited links

原文 2014-01-17 14:19:58 6 1 python/ python-2.7/ scrapy

lets say i am scraping over thousands of pages. 可以说我正在抓取数千页。

Then, when i scrap on a page, i want to know if this page has been scraped before. 然后，当我在某个页面上剪贴时，我想知道此页面之前是否曾经被剪贴过。 Then, i decided if i scrap on it or not. 然后，我决定是否报废。

I want to know if scrapy by default save the scraped pages or not. 我想知道默认情况下，scrapy是否保存所抓取的页面。

what i have tried 我尝试过的

i save the scraped links in a file, then i read it to know if a specific link has been scraped before. 我将抓取的链接保存在文件中，然后阅读以了解以前是否已抓取特定链接。 However, i think that scrapy should have a build-in feature to do that. 但是，我认为scrapy应该具有内置功能来做到这一点。

right? 对？

1 个解决方案

scrapy has that functionality built in and will filter those requests for you, see scrapy request in the docs scrapy内置了该功能，并将为您过滤这些请求，请参阅文档中的scrapy请求

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. dont_filter （布尔值）–指示调度程序不应过滤此请求。 This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. 当您要多次执行相同的请求时，可使用此选项来忽略重复项过滤器。 Use it with care, or you will get into crawling loops. 谨慎使用它，否则您将进入爬网循环。 Default to False. 默认为False。

so when creating requests you can decide if you want to recrawl the same url or not. 因此，在创建请求时，您可以决定是否要重新抓取相同的网址。

for more implementation info see the default RFPDupeFilter in the code 有关更多实施信息，请参见代码中的默认RFPDupeFilter

there is a settings entry called DUPEFILTER_CLASS in case you wish to replace the default one with some other dedup logic 有一个名为DUPEFILTER_CLASS的设置条目，以防您想用其他一些dedup逻辑替换默认条目