简体   繁体   English

如何为scrapy编写规则以添加访问过的url

[英]How to write a rule for scrapy to add visited urls

When scrapy shuts down, it will forget all the urls.当scrapy 关闭时,它会忘记所有的url。 I want to give scrapy a set of urls which have been crawled, when it is begin.我想给scrapy 一组已被抓取的网址,当它开始时。 How could add a rule to crawlspider to let it know which urls have been visited?如何向 crawlspider 添加规则以让它知道哪些 url 已被访问?

current function:当前功能:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

just use parse to tell spider which url to crawl.只需使用 parse 告诉蜘蛛抓取哪个 url。 How could I tell scrapy which urls should not visit?我怎么能告诉scrapy哪些网址不应该访问?

When scrapy stops it will save crawled URLS fingerprints in a request.seen file.当scrapy 停止时,它会将抓取的URLS 指纹保存在request.seen 文件中。 This is done by the dedup class which is used to crawl an url twice but it you restart a scraper with same job directory it will not crawl already seen urls.这是由 dedup 类完成的,该类用于抓取 url 两次,但如果您重新启动具有相同作业目录的刮刀,它不会抓取已经看到的 url。 If you want to control this process you can replace the default dedup class by your own.如果您想控制此过程,您可以自己替换默认的重复数据删除类。 An other solution is to add your own spidermiddleware另一种解决方案是添加您自己的 蜘蛛中间件

Scrapy's Jobs functionality allows you to start and pause your spider. Scrapy 的 Jobs 功能允许您启动和暂停您的蜘蛛。 You can persist information about the spider between runs and it will automatically skip duplicate requests when you restart.您可以在运行之间保留有关蜘蛛的信息,并且在您重新启动时它会自动跳过重复的请求。

See here for more information: Jobs: pausing and resuming crawls有关更多信息,请参见此处:作业:暂停和恢复抓取

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM