简体   繁体   English

有没有一种方法可以在抓取期间清除crawler4j中的访问队列

[英]Is there a way to clear the to visit queue in crawler4j during crawling

I am trying to figure out a way to change seed at crawling runtime and delete completely the "to visit" database/queue. 我试图找出一种在爬网运行时更改种子并完全删除“访问”数据库/队列的方法。

In particular, I would like to remove all the current urls in the queue and add a new seed. 特别是,我想删除队列中的所有当前url并添加一个新种子。 Something along the lines of: 类似于以下内容:

public class MyCrawler extends WebCrawler {

private int discarded = 0;

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    boolean isDiscarded = checkPage(referringPage,url);
    if(isDiscarded){
        this.discarded++;
        if(discarded >= 100){
            //Clear all the urls that need to be visited
            ?_____?
            //Add the new seed
            this.myController.addSeed("http://new_seed.com");
            discarded = 0;
        }
    }
    return isDiscarded;
}

....

I know I can call controller.shutdown() and start everything again but it's kind of slow. 我知道我可以调用controller.shutdown()并重新启动所有内容,但这有点慢。

There is no build-in functionality for achieving this without modifying the original source-code (via forking it or using Reflection API). 在不修改原始源代码(通过分叉或使用Reflection API)的情况下, 没有内置功能可实现此目的。

Every WebCrawler obtains new URLs via a Frontier instance, which stores the current (discovered and not yet fetched) URLs for all web-crawlers. 每个WebCrawler都会通过Frontier实例获取新的URL,该实例存储所有Web爬网程序的当前(发现的和尚未提取的)URL。 Sadly, this variable has private access in WebCrawler . 可悲的是,此变量在WebCrawler具有private访问权限。

If you want to remove all current URLs, you need to reset the Frontier object. 如果要删除所有当前 URL,则需要重置Frontier对象。 Without implementing a custom Frontier (see the source code ), which offers this functionality, resetting will not be possible. 如果不实现提供此功能的自定义Frontier (请参见源代码 ),将无法进行重置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM