有没有一种方法可以在抓取期间清除crawler4j中的访问队列

Question

I am trying to figure out a way to change seed at crawling runtime and delete completely the "to visit" database/queue. 我试图找出一种在爬网运行时更改种子并完全删除“访问”数据库/队列的方法。

In particular, I would like to remove all the current urls in the queue and add a new seed. 特别是，我想删除队列中的所有当前url并添加一个新种子。 Something along the lines of: 类似于以下内容：

public class MyCrawler extends WebCrawler {

private int discarded = 0;

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    boolean isDiscarded = checkPage(referringPage,url);
    if(isDiscarded){
        this.discarded++;
        if(discarded >= 100){
            //Clear all the urls that need to be visited
            ?_____?
            //Add the new seed
            this.myController.addSeed("http://new_seed.com");
            discarded = 0;
        }
    }
    return isDiscarded;
}

....

I know I can call controller.shutdown() and start everything again but it's kind of slow. 我知道我可以调用controller.shutdown（）并重新启动所有内容，但这有点慢。

Answer 1

There is no build-in functionality for achieving this without modifying the original source-code (via forking it or using Reflection API). 在不修改原始源代码（通过分叉或使用Reflection API）的情况下，没有内置功能可实现此目的。

Every WebCrawler obtains new URLs via a Frontier instance, which stores the current (discovered and not yet fetched) URLs for all web-crawlers. 每个WebCrawler都会通过Frontier实例获取新的URL，该实例存储所有Web爬网程序的当前（发现的和尚未提取的）URL。 Sadly, this variable has private access in WebCrawler . 可悲的是，此变量在WebCrawler具有private访问权限。

If you want to remove all current URLs, you need to reset the Frontier object. 如果要删除所有当前 URL，则需要重置Frontier对象。 Without implementing a custom Frontier (see the source code ), which offers this functionality, resetting will not be possible. 如果不实现提供此功能的自定义Frontier （请参见源代码），将无法进行重置。

有没有一种方法可以在抓取期间清除crawler4j中的访问队列

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-01-26 13:10:45

有没有一种方法可以在抓取期间清除crawler4j中的访问队列

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-01-26 13:10:45

解决方案1
2 已采纳 2018-01-26 13:10:45