使用 Java 的 ReadWriteLocks 实现多线程 web 爬虫

Question

I am trying to implement a multi-threaded web crawler using ReadWriteLocks.我正在尝试使用 ReadWriteLocks 实现多线程 web 爬虫。 I have a Callable calling an API to get page URLs and crawl them when they are not present in the Seen URLs Set.我有一个调用 API 的 Callable 来获取页面 URL 并在它们不存在于已见 URL 集中时抓取它们。

From the ExecutorService I am using three threads to do the crawl.从 ExecutorService 我使用三个线程进行爬网。

The problem is - different threads are reading the same URL twice.问题是 - 不同的线程正在读取相同的 URL 两次。 How can I prevent different threads from reading a visited URL?如何防止不同的线程读取访问过的 URL？

package Threads;

import java.util.*;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;

public class WebCrawler {

     static HashSet<String> seenURL = new HashSet<>();
     List<String> resultVisitedUrls = new ArrayList<>();
     ReadWriteLock lock_http_request = new ReentrantReadWriteLock();
     Lock readLock_http_request = lock_http_request.readLock();
     Lock writeLock_http_request = lock_http_request.writeLock();


    public  boolean contains(String url){
        readLock_http_request.lock();
        try {
            if(!seenURL.contains(url)){
                return false;
            }else{
                return true;
            }
        }finally {
            readLock_http_request.unlock();
        }
    }
    public void addUrlToSeenURLSet(String url){

        writeLock_http_request.lock();
        try{
            seenURL.add(url);

        }finally {
            writeLock_http_request.unlock();
        }
    }

    public List<String> getResultVisitedUrls() {
        return resultVisitedUrls;
    }


    public void crawl(String startUrl, HtmlParser htmlParser, WebCrawler crawler) throws Exception {


        if (!crawler.contains(startUrl)) {
            try {
                crawler.addUrlToSeenURLSet(startUrl);
                List<String> subUrls = htmlParser.getUrls(startUrl);

                resultVisitedUrls.add(startUrl + "  Done by thread - " + Thread.currentThread());

                for (String subUrl : subUrls) {
                    crawl(subUrl, htmlParser, crawler); 
                }
            } catch (Exception ex) {
                throw new Exception("Something went wrong. Method - crawl : " + ex.getMessage());
            }
        }

    }

    public static void main(String[] args) {
        class Crawl implements Callable<List<String>> {
            String startUrl;
            WebCrawler webCrawler;

            public Crawl(String startUrl, WebCrawler webCrawler){
                this.startUrl = startUrl;
                this.webCrawler = webCrawler;
            }

            public List<String> call() {
                HtmlParser htmlParser = new RetrieveURLs();
                List<String> result = new ArrayList<>();
                try {
                    webCrawler.crawl(startUrl, htmlParser, webCrawler);
                    result =  webCrawler.getResultVisitedUrls();
                }catch(Exception ex){
                    System.err.println("Some exception occurred in run() - " + ex.getMessage());
                }
                return result;
            }
        }

        ExecutorService service = Executors.newFixedThreadPool(4);
        try{
            WebCrawler webCrawler = new WebCrawler();
            WebCrawler webCrawler1 = new WebCrawler();

            Future<List<String>> future_1 = service.submit(new Crawl("http://localhost:3001/getUrls/google.com", webCrawler));
            Future<List<String>> future_2 = service.submit(new Crawl("http://localhost:3001/getUrls/google.com", webCrawler1));
            Future<List<String>> future_3 = service.submit(new Crawl("http://localhost:3001/getUrls/google.com", webCrawler1));

            List<String> result_1 = future_1.get();
            List<String> result_2 = future_2.get();
            List<String> result_3 = future_3.get();

            result_1.addAll(result_2);
            result_2.addAll(result_3);
            //Assert.assertEquals(6, result_1.size());
            System.out.println(result_1.size());
            for(String str : result_1){
                System.out.println(str );
            }

        }catch(ExecutionException | InterruptedException ex){

        }finally {
            service.shutdown();
        }

    }
}

Answer 1

Your fault lies with fact that 2 threads can call contains(url) with same value and both get false, so they both then enter the code block with crawler.addUrlToSeenURLSet(startUrl) .您的错误在于 2 个线程可以调用具有相同值的contains(url)并且都为 false，因此它们都使用crawler.addUrlToSeenURLSet(startUrl)进入代码块。

Rather than using a pair of locks just use a concurrent set which is backed by ConcurrentHashMap and thread-safe.与其使用一对锁，不如使用由ConcurrentHashMap和线程安全支持的并发集。

private static final Set<String> seenURLs = ConcurrentHashMap.newKeySet();

When you use this set you only need to call add as that returns true on first call and false if the set already contains the same startUrl value that another thread is working on:当您使用这个集合时，您只需要调用add ，因为它在第一次调用时返回true ，如果集合已经包含另一个线程正在处理的相同startUrl值，则返回false ：

if(seenURLs.add(startUrl)) {
   ...
}

If you wish to use a lock you could also change addUrlToSeenURLSet to return seenURL.add(url);如果您想使用锁，您还可以更改addUrlToSeenURLSet以return seenURL.add(url); and the use that method with the test if(addUrlToSeenURLSet(startUrl)) before running the crawl.并在运行爬网之前将该方法与测试if(addUrlToSeenURLSet(startUrl))使用。

使用 Java 的 ReadWriteLocks 实现多线程 web 爬虫

问题描述

1 个解决方案

解决方案1
2 2022-09-04 22:04:10

使用 Java 的 ReadWriteLocks 实现多线程 web 爬虫

问题描述

1 个解决方案

解决方案1 2 2022-09-04 22:04:10

解决方案1
2 2022-09-04 22:04:10