简体   繁体   English

检索具有特定前缀的网址

[英]Crawl urls with a certain prefix

I would like to just crawl with crawler4j , certain URLs which have a certain prefix. 我只想使用crawler4j进行爬网,某些具有特定前缀的URL。

So for example, if an URL starts with http://url1.com/timer/image it is valid. 因此,例如,如果URL以http://url1.com/timer/image开头,则它是有效的。 Eg: http://url1.com/timer/image/text.php . 例如: http://url1.com/timer/image/text.php : http://url1.com/timer/image/text.php

This URL is not valid: http://test1.com/timer/image 该URL无效: http://test1.com/timer/image : http://test1.com/timer/image

I tried to implement it like that: 我试图像这样实现它:

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    String adrs1 = "http://url1.com/timer/image";
    String adrs2 = "http://url2.com/house/image";

    if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
        return false;
    }

    if (filters.matcher(href).matches()) {
        return false;
    }

    for (String crawlDomain : myCrawlDomains) {
        if (href.startsWith(crawlDomain)) {
            return true;
        }
    }

    return false;
}

However, it does not seem that this works, because the crawler also visits other URLs. 但是,这似乎不起作用,因为搜寻器还会访问其他URL。

Any recommendation what I could so? 有什么建议我可以吗?

I appreciate your answer! 感谢您的回答!

Basically you can have an array of prefixes which holds allowed URLs which you want to crawl. 基本上,您可以有一个前缀数组,其中包含要爬网的允许的URL。 And inside your method just travers the array return true if only it machetes with any of your allowed prefix. 并且在您的方法内部遍历该数组,如果仅在使用任何允许的前缀进行砍刀时返回true。 That means you dont have to list any domains which you don't want to crawl. 这意味着您不必列出不想爬网的任何域。

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    // prefixes that you want to crawl
    String allowedPrefixes[] = {"http://url1.com", "http://url2.com"};

    for (String allowedPrefix : allowedPrefixes) {
        if (href.startsWith(allowedPrefix)) {
            return true;
        }
     }

    return false;
}

Your code is not working because your condition is incorrect: 您的代码不正确,因为您的条件不正确:

(!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))

Another reason is you might not have configured crawlerDomains . 另一个原因是您可能尚未配置crawlerDomains It is configured during startup of your application by calling CrawlController#setCustomData(crawler1Domains); 它是在应用程序启动期间通过调用CrawlController#setCustomData(crawler1Domains);

Look at sample source code of crawler4j, crawlerDomains are set here: MultipleCrawlerController.java#79 看一下crawler4j的示例源代码,在这里设置了crawlerDomains: MultipleCrawlerController.java#79

Look at the below code. 看下面的代码。 it may help you. 它可能会帮助您。

public boolean shouldVisit(Page page,WebURL url) {
   String href = url.getURL().toLowerCase();
   String adrs1 = "http://url1.com/timer/image";
   String adrs2 = "http://url2.com/house/image";
   return !FILTERS.matcher(href).matches() && (href.startsWith(adrs1) || href.startsWith(adrs2));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM