Multithreaded DFS for web crawler in Java

Question

I am writing a web crawler in Java using Jsoup.

Currently I have a single-threaded implementation that uses depth-first search (it only has to crawl one domain so I could have chosen either DFS or BFS, and opted for DFS as it meant I could use a queue instead of a stack, and therefore use LinkedBlockingQueue when I do a multithreaded version)

I have a Queue of links to visit and a HashSet of already-visited links, and my main loop pops a link from the queue, visits the page, and adds any unvisited links from the page to the queue.

This is the contents of my class implementing my single-threaded implementation (if any of the throws declarations are spurious please let me know why as I need to get to grips with that)

private static LinkedBlockingQueue<String> URLSToCrawl = new LinkedBlockingQueue<String>();
private static String baseURL;
private static String HTTPSBaseURL;
private static HashSet<String> alreadyCrawledSet = new HashSet<String>();
private static List<String> deadLinks = new LinkedList<String>();

public static void main(String[] args) throws IOException, InterruptedException {

    // should output a site map, showing the static assets for each page. 

    Validate.isTrue(args.length == 1, "usage: supply url to fetch");

    baseURL = args[0];
    HTTPSBaseURL = baseURL.replace("http://", "https://");

    alreadyCrawledSet.add(baseURL);
    URLSToCrawl.add(baseURL);

    while (!URLSToCrawl.isEmpty() ) {
        String url = URLSToCrawl.take();
        crawlURL(url);
    }


}

private static void crawlURL(String url) throws IOException, InterruptedException {
    print("%s", url);
    try {
        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");

        for (Element link : links) {
            String linkURL = link.attr("abs:href");
            if (sameDomain(linkURL) && !alreadyCrawled(linkURL)) {
                alreadyCrawledSet.add(linkURL);
                URLSToCrawl.put(linkURL);
            }
        }
    } catch (HttpStatusException e) {
        deadLinks.add(url);
    }
}   

private static boolean alreadyCrawled(String url) {
    if (alreadyCrawledSet.contains(url)) {
        return true;
    } else {
        return false;
    }
}

I would like to make this multithreaded, to take advantage of the fact that the single-threaded implementation has to wait for the HTTP request in the Jsoup.connect(url).get() call to return before continuing to process. I am hoping that by allowing multiple threads to act at once, some work will get done during this I/O-bound delay, and therefore speed up the program.

I am not very experienced with concurrency - my first thought was to simply create an Executor and just submit every call to crawlURL to it. But I am confused - I don't know how to make sure my HashSet and Queue are accessed in a threadsafe manner, especially given that each thread not only consumes URLs from the Queue but also pushes new URLs onto the Queue .

I understand the basics of the concepts of atomicity, and the idea that threads can "lock" shared resources but I don't know how to put them into practice in this scenario.

Does anyone have any advice for making this multithreaded?

Answer 1

My solution was to deal with one layer of the graph at a time. So for each level, submit each link to the ExecutorService to be crawled (multithreaded) but then wait for that level to be completed (using a CountDownLatch ) before moving onto the next level.

I used a FixedThreadPool as a form of rate limiting.

(Initially I tried to just dispatch every url asynchronously, which must be more efficient, but I couldn't figure out how to terminate the whole thing.)

Multithreaded DFS for web crawler in Java

Question

1 answers

solution1
0 ACCPTED 2016-01-27 19:25:56

Multithreaded DFS for web crawler in Java

Question

1 answers

solution1 0 ACCPTED 2016-01-27 19:25:56

solution1
0 ACCPTED 2016-01-27 19:25:56