如何设置简单的JAVA Web爬虫的深度

Question

I wrote a simple recursive web crawler to fetch just the URL links from the web page recursively. 我编写了一个简单的递归Web搜寻器，以递归方式仅从网页中获取URL链接。

Now I am trying to figure out a way to limit the crawler using depth but I am not sure how to limit the crawler by specific depth (I can limit the crawler by top N links but I want to limit using depth) 现在，我试图找出一种方法来限制搜寻器的使用深度，但是我不确定如何通过特定深度来限制搜寻器（我可以通过前N个链接来限制搜寻器，但我想限制使用深度）

For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link

Any inputs is appreciated. 任何输入表示赞赏。

    public class SimpleCrawler {

    static Map<String, String> retMap = new ConcurrentHashMap<String, String>();    

        public static void main(String args[]) throws IOException {
         StringBuffer sb = new StringBuffer();  
         Map<String, String> map = (returnURL("http://www.google.com"));
         recursiveCrawl(map);
          for (Map.Entry<String, String> entry : retMap.entrySet()) {
            sb.append(entry.getKey());
          }
        }

        public static void recursiveCrawl(Map<String, String> map)
                throws IOException {
            for (Map.Entry<String, String> entry : map.entrySet()) {
                String key = entry.getKey();
                Map<String, String> recurSive = returnURL(key);
                recursiveCrawl(recurSive);
            }
        }

        public synchronized static Map<String, String> returnURL(String URL)
                throws IOException {

            Map<String, String> tempMap = new HashMap<String, String>();
            Document doc = null;
            if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
                System.out.println("Processing==>" + URL);
                try {
                    URL url = new URL(URL);
                    System.setProperty("http.proxyHost", "proxy");
                    System.setProperty("http.proxyPort", "port");
                    doc = Jsoup.connect(URL).get();
                    if (doc != null) {
                        Elements links = doc.select("a");
                        String FinalString = "";
                        for (Element e : links) {
                            FinalString = "http:" + e.attr("href");
                            if (!retMap.containsKey(FinalString)) {
                                tempMap.put(FinalString, FinalString);
                            }
                        }
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
                retMap.put(URL, URL);
            } else {
                System.out.println("****Skipping URL****" + URL);
            }
            return tempMap;
        }

    }

EDIT1: 编辑1：

I thought of using worklist hence modified the code. 我想到了使用工作清单，因此修改了代码。 I am not exactly sure how to set depth here too (I can set the number of webpages to crawl but not exactly depth). 我也不太确定如何在此处设置深度（我可以设置要爬网的网页数量，但不能完全设置深度）。 Any suggestions would be appreciated. 任何建议，将不胜感激。

public void startCrawl(String url) {
        while (this.pagesVisited.size() < 2) {
            String currentUrl;
            SpiderLeg leg = new SpiderLeg();
            if (this.pagesToVisit.isEmpty()) {
                currentUrl = url;
                this.pagesVisited.add(url);
            } else {
                currentUrl = this.nextUrl();
            }
            leg.crawl(currentUrl);
            System.out.println("pagesToVisit Size" + pagesToVisit.size());
            // SpiderLeg
            this.pagesToVisit.addAll(leg.getLinks());
        }
        System.out.println("\n**Done** Visited " + this.pagesVisited.size()
                + " web page(s)");
    }

Answer 1

Based on the non-recursive approach: 基于非递归方法：

Keep a worklist of URLs pagesToCrawl of type CrawlURL 保持URL的工作列表pagesToCrawl型CrawlURL

class CrawlURL {
  public String url;
  public int depth;

  public CrawlURL(String url, int depth) {
    this.url = url;
    this.depth = depth;
  }
}

initially (before entering the loop): 最初（进入循环之前）：

Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from

now the loop: 现在循环：

while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
  CrawlURL currentUrl = pagesToCrawl.remove();
  //analyze the url
  //updated with crawled links
}

and the updating with links: 以及带有链接的更新：

if (currentUrl.depth < 2) {
  for (String url : leg.getLinks()) { // referring to your analysis result
    pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
  }
}

You could enhance CrawlURL with other meta data (eg link name, referrer,. etc.). 您可以使用其他元数据（例如，链接名称，引荐来源网址等）增强CrawlURL。

Alternative: In my upper comment I mentioned a generational approach. 备选方案：在我的最高意见中，我提到了一种代际方法。 Its a bit more complex than this one. 它比这复杂一点。 The basic Idea is to keep to lists ( currentPagesToCrawl and futurePagesToCrawl ) together with a generation variable (starting with 0 and increasing every time currentPagesToCrawl gets empty). 基本思想是将列表（ currentPagesToCrawl和futurePagesToCrawl ）与世代变量（从0开始，每次currentPagesToCrawl空增加）保持在一起。 All crawled urls are put into the futurePagesToCrawl queue and if currentPagesToCrawl both lists will be switched. 所有已爬网的URL都放入futurePagesToCrawl队列中，如果currentPagesToCrawl两个列表都将被切换。 This is done until the generation variable reaches 2. 直到生成变量达到2为止。

Answer 2

You could add a depth parameter on the signature of your recursive method eg 您可以在递归方法的签名上添加一个depth参数，例如

on your main 在你的主要

recursiveCrawl(map,0);

and 和

public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
    if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
        for (Map.Entry<String, String> entry : map.entrySet()) {
            String key = entry.getKey();
            Map<String, String> recurSive = returnURL(key);
            recursiveCrawl(recurSive, depth);
        }
    }
]

Answer 3

You can do something like this: 您可以执行以下操作：

static int maxLevels = 10;

public static void main(String args[]) throws IOException {
     ...
     recursiveCrawl(map,0);
     ...
}

public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
    for (Map.Entry<String, String> entry : map.entrySet()) {
        String key = entry.getKey();
        Map<String, String> recurSive = returnURL(key);
        if (level < maxLevels) {
            recursiveCrawl(recurSive, ++level);
        }
    }
}

Also, you can use a Set instead of a Map . 另外，您可以使用Set而不是Map 。

如何设置简单的JAVA Web爬虫的深度

问题描述

3 个解决方案

解决方案1
2 已采纳 2015-12-16 07:11:37

解决方案2
1 2015-12-15 18:56:20

解决方案3
0 2015-12-15 18:51:18

如何设置简单的JAVA Web爬虫的深度

问题描述

3 个解决方案

解决方案1 2 已采纳 2015-12-16 07:11:37

解决方案2 1 2015-12-15 18:56:20

解决方案3 0 2015-12-15 18:51:18

解决方案1
2 已采纳 2015-12-16 07:11:37

解决方案2
1 2015-12-15 18:56:20

解决方案3
0 2015-12-15 18:51:18