简体繁体 English

是否可以使用Java搜寻器crawler4j暂停和恢复搜寻？

[英]Is it possible to pause and resume crawling using Java crawler crawler4j?

原文 2017-10-16 11:19:33 4 1 java/ web-scraping/ web-crawler/ crawler4j

I already know that you can configure crawling to be resumable. 我已经知道您可以将爬网配置为可恢复的。

But is it possible to use resumable functionality to pause crawling process and then resume crawling later programmatically? 但是是否可以使用可恢复功能来暂停爬网过程，然后以编程方式稍后恢复爬网？ Eg I can gracefully shutdown crawling with shutdown method of the crawler and with resumable parameter set to true , then start again crawling. 例如，我可以使用搜寻器的shutdown方法并将可恢复参数设置为true来正常shutdown搜寻，然后再次开始搜寻。

Will it work this way, because primary purpose of resumable parameter is to handle accidental crashes of crawler. 它会这样工作吗，因为可恢复参数的主要目的是处理爬网程序的意外崩溃。 Is there any other or better way how to achieve this functionality with crawler4j? 还有其他或更好的方法如何使用crawler4j实现此功能吗？

1 个解决方案

If you set the parameter resumable to true , the Frontier as well as the DocIdServer will store their queues on the user-defined storage folder. 如果将参数resumable设置为true ，则Frontier以及DocIdServer会将其队列存储在用户定义的存储文件夹中。

This works either for a crash or for a programmatic shutdown. 这适用于崩溃或程序性关闭。 In both cases, the storage folder must be the same. 在这两种情况下，存储文件夹都必须相同。

See also the related issue on the offical issue tracker 另请参阅官方问题跟踪器上的相关问题