[英]Is there a way to clear the to visit queue in crawler4j during crawling
I am trying to figure out a way to change seed at crawling runtime and delete completely the "to visit" database/queue. 我试图找出一种在爬网运行时更改种子并完全删除“访问”数据库/队列的方法。
In particular, I would like to remove all the current urls in the queue and add a new seed. 特别是,我想删除队列中的所有当前url并添加一个新种子。 Something along the lines of:
类似于以下内容:
public class MyCrawler extends WebCrawler {
private int discarded = 0;
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
boolean isDiscarded = checkPage(referringPage,url);
if(isDiscarded){
this.discarded++;
if(discarded >= 100){
//Clear all the urls that need to be visited
?_____?
//Add the new seed
this.myController.addSeed("http://new_seed.com");
discarded = 0;
}
}
return isDiscarded;
}
....
I know I can call controller.shutdown() and start everything again but it's kind of slow. 我知道我可以调用controller.shutdown()并重新启动所有内容,但这有点慢。
There is no build-in functionality for achieving this without modifying the original source-code (via forking it or using Reflection API). 在不修改原始源代码(通过分叉或使用Reflection API)的情况下, 没有内置功能可实现此目的。
Every WebCrawler
obtains new URLs via a Frontier
instance, which stores the current (discovered and not yet fetched) URLs for all web-crawlers. 每个
WebCrawler
都会通过Frontier
实例获取新的URL,该实例存储所有Web爬网程序的当前(发现的和尚未提取的)URL。 Sadly, this variable has private
access in WebCrawler
. 可悲的是,此变量在
WebCrawler
具有private
访问权限。
If you want to remove all current URLs, you need to reset the Frontier
object. 如果要删除所有当前 URL,则需要重置
Frontier
对象。 Without implementing a custom Frontier
(see the source code ), which offers this functionality, resetting will not be possible. 如果不实现提供此功能的自定义
Frontier
(请参见源代码 ),将无法进行重置。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.