[英]Multithreaded Web Crawler in Java
我正在尝试使用Jsoup在Java中编写多线程Web爬网程序。我有一个Java类“ Master”,它创建6个线程(5个用于爬网,1个用于维护队列),以及3个队列,即“ to_do”,“ to_do_next”( (在下一次迭代中完成)和“完成”(最终链接)。 我在共享队列上使用了同步锁。这个想法是,一旦所有5个线程发现“ to_do”队列为空,它们就会通知维护线程进行一些工作并将这些线程通知回来。但是问题是程序被阻塞了有时(所以我假设有些比赛条件我无法照顾)....同时检查时发现并非所有线程都通过维护线程得到通知,因此有些通知信号可能会丢失??
大师班代码
private Queue<String> to_do = new LinkedList<String>();
private Queue<String> done= new LinkedList<String>();
private Queue<String> to_do_next = new LinkedList<String>();
private int level = 1;
private Object lock1 = new Object();
private Object lock2 = new Object();
private Object lock3 = new Object();
private static Thread maintenance;
public static Master mref;
public static Object wait1 = new Object();
public static Object wait2 = new Object();
public static Object wait3 = new Object();
public static int flag = 5;
public static int missedSignals = -1;
public boolean checkToDoEmpty(){
return to_do.isEmpty();
}
public int getLevel() {
return level;
}
public void incLevel() {
this.level++;
}
public static void interrupt() {
maintenance.interrupt();
}
public void transfer() {
to_do = to_do_next;
}
public String accessToDo() {
synchronized(lock1){
String tmp = to_do.peek();
if(tmp != null)
tmp = to_do.remove();
return tmp;
}
}
public void addToDoNext(String url){
synchronized(lock2){
to_do_next.add(url);
}
}
public void addDone(String string) {
synchronized(lock3){
done.add(string);
}
}
public static void main(String[] args){
Master m = new Master();
mref = m;
URL startUrl = null;
try {
startUrl = new URL("http://cse.iitkgp.ac.in");
}catch (MalformedURLException e1) {
e1.printStackTrace();
}
Thread t1 = new Thread(new Worker(1));
Thread t2 = new Thread(new Worker(2));
Thread t3 = new Thread(new Worker(3));
Thread t4 = new Thread(new Worker(4));
Thread t5 = new Thread(new Worker(5));
maintenance = new Thread(new MaintenanceThread());
m.to_do.add(startUrl.toString());
maintenance.start();
t1.start();
t2.start();
t3.start();
t4.start();
t5.start();
try {
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
/*for(String s:m.done)
System.out.println(s);
for(String s:m.to_do)
System.out.println(s);*/
}
工作者线程代码
public void run() {
while(Master.mref.getLevel() != 3){
if(!Master.mref.checkToDoEmpty()){
String url = Master.mref.accessToDo();
if(url != null && url.contains("iitkgp") && url.contains("http://")){
try {
Document doc = Jsoup.connect(url).get();
org.jsoup.select.Elements links = doc.select("a[href]");
for(org.jsoup.nodes.Element l: links){
Master.mref.addToDoNext(l.attr("abs:href").toString());
}
Master.mref.addDone(url);
} catch (IOException e) {
System.out.println(url);
e.printStackTrace();
}
continue;
}
}
//System.out.println("thread " + id + " about to notify on wait1");
synchronized(Master.wait1){
Master.wait1.notify();
Master.missedSignals++;
}
synchronized(Master.wait2){
try {
Master.wait2.wait();
System.out.println("thread " + id + " coming out of wait2");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
System.out.println("Terminating " + id + " thread");
Master.flag--;
if(Master.flag == 0)
Master.interrupt();
}
维护线程代码
while(Master.flag != 0){
try {
synchronized(Master.wait1){
if(Master.missedSignals != -1){
count += Master.missedSignals;
Master.missedSignals = -1;
}
while(count != 5){
Master.wait1.wait();
if(Master.missedSignals != -1)
count += Master.missedSignals;
Master.missedSignals = -1;
count++;
}
count = 0;
}
//System.out.println("in between");
Master.mref.incLevel();
Master.mref.transfer();
synchronized(Master.wait2){
Master.wait2.notifyAll();
}
} catch (InterruptedException e) {
break;
}
}
System.out.println("Mainta thread gone");
您的设计太复杂了
我建议为您的to_do队列使用以下内容:LinkedBlockingQueue
这是一个阻塞队列,这意味着您的线程将从队列中请求一个对象,并且只有当线程出现时,它们才会获取该对象,直到它们保持阻塞为止。
只需使用以下方法将对象放入和放入队列中:put()&take()
请查看以下两个链接,以获取有关此特殊队列的更多说明: http : //docs.oracle.com/javase/7/docs/api/java/util/concurrent/LinkedBlockingQueue.html
http://tutorials.jenkov.com/java-util-concurrent/linkedblockingqueue.html
现在,您唯一需要考虑的是在完成工作后杀死线程,为此,我建议以下几点:
boolean someThreadStillAlive = true;
while (someThreadStillAlive) {
someThreadStillAlive = false;
Thread.sleep(200);
for (Thread t : fetchAndParseThreads) {
someThreadStillAlive = someThreadStillAlive || t.isAlive();
}
}
这将在您的主代码块中发生,它将在其中循环和休眠,直到所有线程完成。
哦,可以使用poll(int timeout ...)代替take(),它会等待超时完成,如果没有新对象插入队列,它将杀死线程。
以上所有内容均已在我自己的搜寻器中成功使用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.