简体   繁体   中英

Implementing a multi-threaded web crawler with Java's ReadWriteLocks

I am trying to implement a multi-threaded web crawler using ReadWriteLocks. I have a Callable calling an API to get page URLs and crawl them when they are not present in the Seen URLs Set.

From the ExecutorService I am using three threads to do the crawl.

The problem is - different threads are reading the same URL twice. How can I prevent different threads from reading a visited URL?

package Threads;

import java.util.*;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;

public class WebCrawler {

     static HashSet<String> seenURL = new HashSet<>();
     List<String> resultVisitedUrls = new ArrayList<>();
     ReadWriteLock lock_http_request = new ReentrantReadWriteLock();
     Lock readLock_http_request = lock_http_request.readLock();
     Lock writeLock_http_request = lock_http_request.writeLock();


    public  boolean contains(String url){
        readLock_http_request.lock();
        try {
            if(!seenURL.contains(url)){
                return false;
            }else{
                return true;
            }
        }finally {
            readLock_http_request.unlock();
        }
    }
    public void addUrlToSeenURLSet(String url){

        writeLock_http_request.lock();
        try{
            seenURL.add(url);

        }finally {
            writeLock_http_request.unlock();
        }
    }

    public List<String> getResultVisitedUrls() {
        return resultVisitedUrls;
    }


    public void crawl(String startUrl, HtmlParser htmlParser, WebCrawler crawler) throws Exception {


        if (!crawler.contains(startUrl)) {
            try {
                crawler.addUrlToSeenURLSet(startUrl);
                List<String> subUrls = htmlParser.getUrls(startUrl);

                resultVisitedUrls.add(startUrl + "  Done by thread - " + Thread.currentThread());

                for (String subUrl : subUrls) {
                    crawl(subUrl, htmlParser, crawler); 
                }
            } catch (Exception ex) {
                throw new Exception("Something went wrong. Method - crawl : " + ex.getMessage());
            }
        }

    }

    public static void main(String[] args) {
        class Crawl implements Callable<List<String>> {
            String startUrl;
            WebCrawler webCrawler;

            public Crawl(String startUrl, WebCrawler webCrawler){
                this.startUrl = startUrl;
                this.webCrawler = webCrawler;
            }

            public List<String> call() {
                HtmlParser htmlParser = new RetrieveURLs();
                List<String> result = new ArrayList<>();
                try {
                    webCrawler.crawl(startUrl, htmlParser, webCrawler);
                    result =  webCrawler.getResultVisitedUrls();
                }catch(Exception ex){
                    System.err.println("Some exception occurred in run() - " + ex.getMessage());
                }
                return result;
            }
        }

        ExecutorService service = Executors.newFixedThreadPool(4);
        try{
            WebCrawler webCrawler = new WebCrawler();
            WebCrawler webCrawler1 = new WebCrawler();

            Future<List<String>> future_1 = service.submit(new Crawl("http://localhost:3001/getUrls/google.com", webCrawler));
            Future<List<String>> future_2 = service.submit(new Crawl("http://localhost:3001/getUrls/google.com", webCrawler1));
            Future<List<String>> future_3 = service.submit(new Crawl("http://localhost:3001/getUrls/google.com", webCrawler1));

            List<String> result_1 = future_1.get();
            List<String> result_2 = future_2.get();
            List<String> result_3 = future_3.get();

            result_1.addAll(result_2);
            result_2.addAll(result_3);
            //Assert.assertEquals(6, result_1.size());
            System.out.println(result_1.size());
            for(String str : result_1){
                System.out.println(str );
            }

        }catch(ExecutionException | InterruptedException ex){

        }finally {
            service.shutdown();
        }

    }
}

Your fault lies with fact that 2 threads can call contains(url) with same value and both get false, so they both then enter the code block with crawler.addUrlToSeenURLSet(startUrl) .

Rather than using a pair of locks just use a concurrent set which is backed by ConcurrentHashMap and thread-safe.

private static final Set<String> seenURLs = ConcurrentHashMap.newKeySet(); 

When you use this set you only need to call add as that returns true on first call and false if the set already contains the same startUrl value that another thread is working on:

if(seenURLs.add(startUrl)) {
   ...
}

If you wish to use a lock you could also change addUrlToSeenURLSet to return seenURL.add(url); and the use that method with the test if(addUrlToSeenURLSet(startUrl)) before running the crawl.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM