简体   繁体   English

Web 有广度但没有深度的爬行

[英]Web crawling with breadth but not depth

I'm making my first web crawler using java and jsoup.我正在使用 java 和 jsoup 制作我的第一个 web 爬虫。 I found that piece of code that works, but not as i want.我发现那段代码有效,但不是我想要的。 Problem is that it focuses on depth of links, but I wanna crawl pages on breadth.问题是它关注链接的深度,但我想在广度上抓取页面。 Spend some time trying to rework code focusing on breadth, but it still goes too deep starting from first link.花一些时间尝试重做代码,专注于广度,但从第一个链接开始,它仍然太深了。 Any ideas of how can i do breadth crawling?关于如何进行广度爬行的任何想法?

public class WebCrawlerWithDepth {
    private static final int MAX_DEPTH = 4;
    private HashSet<String> links;

    public WebCrawlerWithDepth() {
        links = new HashSet<>();
    }

    public void getPageLinks(String URL, int depth) {
        if ((!links.contains(URL) && (depth < MAX_DEPTH))) {
            System.out.println("Depth: " + depth + " " + URL);
                links.add(URL);

                Document document = Jsoup.connect(URL).get();
                Elements linksOnPage = document.select("a[href]");

                depth++;
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"), depth);
               }
           }
       }
  

Basically the same way you from depth-first to breadth-first in algorithmic coding, you need a queue.基本上就像你在算法编码中从深度优先到广度优先一样,你需要一个队列。

Add every link you've extracted to the queue, and retrieve new pages to be crawled from that queue.将您提取的每个链接添加到队列中,并从该队列中检索要抓取的新页面。

Here's my take on your code:这是我对您的代码的看法:

public class WebCrawlerWithDepth {

    private static final int MAX_DEPTH = 4;
    private Set<String> visitedLinks;
    private Queue<Link> remainingLinks;

    public WebCrawlerWithDepth() {
        visitedLinks = new HashSet<>();
        remainingLinks = new LinkedList<>();
    }

    public void getPageLinks(String url, int depth) throws IOException {
        remainingLinks.add(new Link(url, 0));
        int maxDepth = Math.max(1, Math.min(depth, MAX_DEPTH));
        processLinks(maxDepth);
    }

    private void processLinks(final int maxDepth) throws IOException {
        while (!remainingLinks.isEmpty()) {
            Link link = remainingLinks.poll();
            int depth = link.level;
            if (depth < maxDepth) {
                Document document = Jsoup.connect(link.url).get();
                Elements linksOnPage = document.select("a[href]");
                for (Element page : linksOnPage) {
                    String href = page.attr("href");
                    if (visitedLinks.add(href)) {
                        remainingLinks.offer(new Link(href, depth + 1));
                    }
                }
            }
        }
    }

    static class Link {

        final String url;
        final int level;

        Link(final String url, final int level) {
            this.url = url;
            this.level = level;
        }
    }
}

Instead of iterating directly on the links in the current page, you need to store them in a Queue .您需要将它们存储在Queue中,而不是直接在当前页面中的链接上进行迭代。 This should store all the links to visit from all pages.这应该存储从所有页面访问的所有链接。 Then you get the next link from the Queue to visit.然后你从Queue中获取下一个链接进行访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM