繁体   English   中英

Java中的多线程Web爬网程序

[英]Multithreaded Web Crawler in Java

我正在尝试使用Jsoup在Java中编写多线程Web爬网程序。我有一个Java类“ Master”,它创建6个线程(5个用于爬网,1个用于维护队列),以及3个队列,即“ to_do”,“ to_do_next”( (在下一次迭代中完成)和“完成”(最终链接)。 我在共享队列上使用了同步锁。这个想法是,一旦所有5个线程发现“ to_do”队列为空,它们就会通知维护线程进行一些工作并将这些线程通知回来。但是问题是程序被阻塞了有时(所以我假设有些比赛条件我无法照顾)....同时检查时发现并非所有线程都通过维护线程得到通知,因此有些通知信号可能会丢失??

大师班代码

private Queue<String> to_do = new LinkedList<String>();
private Queue<String> done= new LinkedList<String>(); 
private Queue<String> to_do_next = new LinkedList<String>();
private int level = 1;
private Object lock1 = new Object();
private Object lock2 = new Object();
private Object lock3 = new Object();
private static Thread maintenance;

public static Master mref;
public static Object wait1 = new Object();
public static Object wait2 = new Object();
public static Object wait3 = new Object();
public static int flag = 5;
public static int missedSignals = -1;

public boolean checkToDoEmpty(){
    return to_do.isEmpty();
}

public int getLevel() {
    return level;
}

public void incLevel() {
    this.level++;
}

public static void interrupt() {
     maintenance.interrupt();
}

public void transfer() {
    to_do = to_do_next;
}

public String accessToDo() {
    synchronized(lock1){
        String tmp = to_do.peek();
        if(tmp != null)
            tmp = to_do.remove();
        return tmp;
    }
}

public void addToDoNext(String url){
    synchronized(lock2){
        to_do_next.add(url);
    }
}

public void addDone(String string) {
    synchronized(lock3){
        done.add(string);
    }

}

public static void main(String[] args){

    Master m = new Master();
    mref = m;
    URL startUrl = null;
    try {
        startUrl = new URL("http://cse.iitkgp.ac.in");
    }catch (MalformedURLException e1) {
        e1.printStackTrace();
    }

    Thread t1 = new Thread(new Worker(1));
    Thread t2 = new Thread(new Worker(2));
    Thread t3 = new Thread(new Worker(3));
    Thread t4 = new Thread(new Worker(4));
    Thread t5 = new Thread(new Worker(5));
    maintenance = new Thread(new MaintenanceThread());

    m.to_do.add(startUrl.toString());

    maintenance.start();
    t1.start();
    t2.start();
    t3.start();
    t4.start();
    t5.start();

    try {
        t1.join();
        t2.join();
        t3.join();
        t4.join();
        t5.join();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    /*for(String s:m.done)
        System.out.println(s);
    for(String s:m.to_do)
        System.out.println(s);*/
}

工作者线程代码

public void run() {

    while(Master.mref.getLevel() != 3){

        if(!Master.mref.checkToDoEmpty()){

            String url = Master.mref.accessToDo();

            if(url != null && url.contains("iitkgp") && url.contains("http://")){

                try {
                    Document doc = Jsoup.connect(url).get();
                    org.jsoup.select.Elements links = doc.select("a[href]");

                    for(org.jsoup.nodes.Element l: links){
                        Master.mref.addToDoNext(l.attr("abs:href").toString());
                    }

                    Master.mref.addDone(url);
                } catch (IOException e) {

                    System.out.println(url);
                    e.printStackTrace();
                }
                continue;
            }   
        }
        //System.out.println("thread " + id + " about to notify on wait1");
        synchronized(Master.wait1){
            Master.wait1.notify();
            Master.missedSignals++;
        }
        synchronized(Master.wait2){
            try {
                Master.wait2.wait();
                System.out.println("thread " + id + " coming out of wait2");
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

    }
    System.out.println("Terminating " + id + " thread");
    Master.flag--;
    if(Master.flag == 0)
        Master.interrupt();
}

维护线程代码

while(Master.flag != 0){

        try {
            synchronized(Master.wait1){
                if(Master.missedSignals != -1){
                    count += Master.missedSignals;
                    Master.missedSignals = -1;
                }
                while(count != 5){
                    Master.wait1.wait();
                    if(Master.missedSignals != -1)
                        count += Master.missedSignals;
                    Master.missedSignals = -1;
                    count++;
                }
                count = 0;
            }
            //System.out.println("in between");
            Master.mref.incLevel();
            Master.mref.transfer();
            synchronized(Master.wait2){
                Master.wait2.notifyAll();
            }

        } catch (InterruptedException e) {

            break;
        }
    }
    System.out.println("Mainta thread gone");

您的设计太复杂了

我建议为您的to_do队列使用以下内容:LinkedBlockingQueue

这是一个阻塞队列,这意味着您的线程将从队列中请求一个对象,并且只有当线程出现时,它们才会获取该对象,直到它们保持阻塞为止。

只需使用以下方法将对象放入和放入队列中:put()&take()

请查看以下两个链接,以获取有关此特殊队列的更多说明: http : //docs.oracle.com/javase/7/docs/api/java/util/concurrent/LinkedBlockingQueue.html

http://tutorials.jenkov.com/java-util-concurrent/linkedblockingqueue.html

现在,您唯一需要考虑的是在完成工作后杀死线程,为此,我建议以下几点:

boolean someThreadStillAlive = true;
while (someThreadStillAlive) {
  someThreadStillAlive = false;
  Thread.sleep(200);
  for (Thread t : fetchAndParseThreads) {
    someThreadStillAlive = someThreadStillAlive || t.isAlive();
  }
}

这将在您的主代码块中发生,它将在其中循环和休眠,直到所有线程完成。

哦,可以使用poll(int timeout ...)代替take(),它会等待超时完成,如果没有新对象插入队列,它将杀死线程。

以上所有内容均已在我自己的搜寻器中成功使用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM