简体   繁体   English

使用 Crawler4j 抓取站点列表

[英]Crawl a list of sites using Crawler4j

I have a problem to load a list of links;我在加载链接列表时遇到问题; these links should be used by controller.addSeed in a loop.这些链接应该由controller.addSeed循环使用。 Here is the code这是代码

SelectorString selector = new SelectorString();
List <String> lista = new ArrayList<>();
lista=selector.leggiFile();
String crawlStorageFolder = "/home/usersstage/Desktop/prova";
for(String x : lista){
    System.out.println(x);
    System.out.println("----");
}

// numberOfCrawlers mostra il numero di thread inizializzati per il
// crawling

int numberOfCrawlers = 2; // threads
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);

// Non mandare più di una richiesta per secondo (1000 mills || 200
// mills?)
config.setPolitenessDelay(200);

// profondità del crawl. -1 per illimitato
config.setMaxDepthOfCrawling(-1);

// numero massimo di pagine da crawllare
config.setMaxPagesToFetch(-1);

config.setResumableCrawling(false);

// instanza del controller per questo crawl
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
        pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher,
        robotstxtServer);
// LOOP used to add several websites (more than 100)
for(int i=0;i<lista.size();i++){
    controller.addSeed(lista.get(i).toString());    
}
controller.start(Crawler.class, numberOfCrawlers);

I need to crawl into this sites and retrieve only rss pages but the output of the crawled list is empty.我需要爬进这个站点,只检索 rss 页面,但爬取列表的输出是空的。

That code that you posted shows how to configure the CrawlController.您发布的代码显示了如何配置 CrawlController。 But you need to configure the Crawler if you only need to crawl rss resources.但是如果只需要抓取rss资源,则需要配置Crawler。 The logic belongs in the 'shouldVisit' method on the crawler.该逻辑属于爬虫上的 'shouldVisit' 方法。 Check this example.检查这个例子。

You will try it below code and can you check shoulVisit method in craler class.您将在下面的代码中尝试,并且可以检查 craler 类中的 shoulVisit 方法。

for(int i=0;i<lista.size();i++){
    controller.addSeed(lista.get(i).toString()); 
    controller.start(Crawler.class, numberOfCrawlers);   
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM