简体   繁体   中英

Multithreaded DFS for web crawler in Java

I am writing a web crawler in Java using Jsoup.

Currently I have a single-threaded implementation that uses depth-first search (it only has to crawl one domain so I could have chosen either DFS or BFS, and opted for DFS as it meant I could use a queue instead of a stack, and therefore use LinkedBlockingQueue when I do a multithreaded version)

I have a Queue of links to visit and a HashSet of already-visited links, and my main loop pops a link from the queue, visits the page, and adds any unvisited links from the page to the queue.

This is the contents of my class implementing my single-threaded implementation (if any of the throws declarations are spurious please let me know why as I need to get to grips with that)

private static LinkedBlockingQueue<String> URLSToCrawl = new LinkedBlockingQueue<String>();
private static String baseURL;
private static String HTTPSBaseURL;
private static HashSet<String> alreadyCrawledSet = new HashSet<String>();
private static List<String> deadLinks = new LinkedList<String>();

public static void main(String[] args) throws IOException, InterruptedException {

    // should output a site map, showing the static assets for each page. 

    Validate.isTrue(args.length == 1, "usage: supply url to fetch");

    baseURL = args[0];
    HTTPSBaseURL = baseURL.replace("http://", "https://");

    alreadyCrawledSet.add(baseURL);
    URLSToCrawl.add(baseURL);

    while (!URLSToCrawl.isEmpty() ) {
        String url = URLSToCrawl.take();
        crawlURL(url);
    }


}

private static void crawlURL(String url) throws IOException, InterruptedException {
    print("%s", url);
    try {
        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");

        for (Element link : links) {
            String linkURL = link.attr("abs:href");
            if (sameDomain(linkURL) && !alreadyCrawled(linkURL)) {
                alreadyCrawledSet.add(linkURL);
                URLSToCrawl.put(linkURL);
            }
        }
    } catch (HttpStatusException e) {
        deadLinks.add(url);
    }
}   

private static boolean alreadyCrawled(String url) {
    if (alreadyCrawledSet.contains(url)) {
        return true;
    } else {
        return false;
    }
}

I would like to make this multithreaded, to take advantage of the fact that the single-threaded implementation has to wait for the HTTP request in the Jsoup.connect(url).get() call to return before continuing to process. I am hoping that by allowing multiple threads to act at once, some work will get done during this I/O-bound delay, and therefore speed up the program.

I am not very experienced with concurrency - my first thought was to simply create an Executor and just submit every call to crawlURL to it. But I am confused - I don't know how to make sure my HashSet and Queue are accessed in a threadsafe manner, especially given that each thread not only consumes URLs from the Queue but also pushes new URLs onto the Queue .

I understand the basics of the concepts of atomicity, and the idea that threads can "lock" shared resources but I don't know how to put them into practice in this scenario.

Does anyone have any advice for making this multithreaded?

My solution was to deal with one layer of the graph at a time. So for each level, submit each link to the ExecutorService to be crawled (multithreaded) but then wait for that level to be completed (using a CountDownLatch ) before moving onto the next level.

I used a FixedThreadPool as a form of rate limiting.

(Initially I tried to just dispatch every url asynchronously, which must be more efficient, but I couldn't figure out how to terminate the whole thing.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM