Java中用于Web爬网程序的多线程DFS

Question

I am writing a web crawler in Java using Jsoup. 我正在使用Jsoup用Java编写Web搜寻器。

Currently I have a single-threaded implementation that uses depth-first search (it only has to crawl one domain so I could have chosen either DFS or BFS, and opted for DFS as it meant I could use a queue instead of a stack, and therefore use LinkedBlockingQueue when I do a multithreaded version) 目前，我有一个使用深度优先搜索的单线程实现（它只需要爬网一个域，因此我可以选择DFS或BFS，并选择DFS，因为这意味着我可以使用队列而不是堆栈，并且因此，当我执行多线程版本时，请使用LinkedBlockingQueue ）

I have a Queue of links to visit and a HashSet of already-visited links, and my main loop pops a link from the queue, visits the page, and adds any unvisited links from the page to the queue. 我有一个要访问的链接Queue和一个已经访问过的链接的HashSet ，我的主循环从队列中弹出一个链接，访问页面，并将页面中所有未访问的链接添加到队列中。

This is the contents of my class implementing my single-threaded implementation (if any of the throws declarations are spurious please let me know why as I need to get to grips with that) 这是实现单线程实现的类的内容（如果任何throws声明是虚假的，请让我知道为什么，因为我需要解决这个问题）

private static LinkedBlockingQueue<String> URLSToCrawl = new LinkedBlockingQueue<String>();
private static String baseURL;
private static String HTTPSBaseURL;
private static HashSet<String> alreadyCrawledSet = new HashSet<String>();
private static List<String> deadLinks = new LinkedList<String>();

public static void main(String[] args) throws IOException, InterruptedException {

    // should output a site map, showing the static assets for each page. 

    Validate.isTrue(args.length == 1, "usage: supply url to fetch");

    baseURL = args[0];
    HTTPSBaseURL = baseURL.replace("http://", "https://");

    alreadyCrawledSet.add(baseURL);
    URLSToCrawl.add(baseURL);

    while (!URLSToCrawl.isEmpty() ) {
        String url = URLSToCrawl.take();
        crawlURL(url);
    }


}

private static void crawlURL(String url) throws IOException, InterruptedException {
    print("%s", url);
    try {
        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");

        for (Element link : links) {
            String linkURL = link.attr("abs:href");
            if (sameDomain(linkURL) && !alreadyCrawled(linkURL)) {
                alreadyCrawledSet.add(linkURL);
                URLSToCrawl.put(linkURL);
            }
        }
    } catch (HttpStatusException e) {
        deadLinks.add(url);
    }
}   

private static boolean alreadyCrawled(String url) {
    if (alreadyCrawledSet.contains(url)) {
        return true;
    } else {
        return false;
    }
}

I would like to make this multithreaded, to take advantage of the fact that the single-threaded implementation has to wait for the HTTP request in the Jsoup.connect(url).get() call to return before continuing to process. 我想使用多线程，以利用单线程实现必须等待Jsoup.connect(url).get()调用中的HTTP请求返回然后继续处理的Jsoup.connect(url).get() 。 I am hoping that by allowing multiple threads to act at once, some work will get done during this I/O-bound delay, and therefore speed up the program. 我希望通过允许多个线程同时作用，可以在此I / O约束延迟期间完成一些工作，从而加快程序的速度。

I am not very experienced with concurrency - my first thought was to simply create an Executor and just submit every call to crawlURL to it. 我对并发不是很有经验-我的第一个想法是简单地创建一个Executor然后将每个对crawlURL调用提交给它。 But I am confused - I don't know how to make sure my HashSet and Queue are accessed in a threadsafe manner, especially given that each thread not only consumes URLs from the Queue but also pushes new URLs onto the Queue . 但是我很困惑-我不知道如何确保以线程安全的方式访问我的HashSet和Queue ，特别是考虑到每个线程不仅消耗 Queue URL，还将新URL推送到Queue 。

I understand the basics of the concepts of atomicity, and the idea that threads can "lock" shared resources but I don't know how to put them into practice in this scenario. 我了解原子性概念的基础，以及线程可以“锁定”共享资源的想法，但是我不知道如何在这种情况下将它们付诸实践。

Does anyone have any advice for making this multithreaded? 有没有人建议使这种多线程？

Answer 1

My solution was to deal with one layer of the graph at a time. 我的解决方案是一次处理一张图表。 So for each level, submit each link to the ExecutorService to be crawled (multithreaded) but then wait for that level to be completed (using a CountDownLatch ) before moving onto the next level. 因此，对于每个级别，将每个链接提交到要爬网（多线程）的ExecutorService ，然后等待该级别完成（使用CountDownLatch ），然后再进入下一个级别。

I used a FixedThreadPool as a form of rate limiting. 我使用了FixedThreadPool作为速率限制的一种形式。

(Initially I tried to just dispatch every url asynchronously, which must be more efficient, but I couldn't figure out how to terminate the whole thing.) （最初，我试图异步地分派每个URL，这必须更加有效，但我无法弄清楚如何终止整个过程。）

Java中用于Web爬网程序的多线程DFS

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-01-27 19:25:56

Java中用于Web爬网程序的多线程DFS

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-01-27 19:25:56

解决方案1
0 已采纳 2016-01-27 19:25:56