[英]Multithreaded DFS for web crawler in Java
I am writing a web crawler in Java using Jsoup. 我正在使用Jsoup用Java编写Web搜寻器。
Currently I have a single-threaded implementation that uses depth-first search (it only has to crawl one domain so I could have chosen either DFS or BFS, and opted for DFS as it meant I could use a queue instead of a stack, and therefore use LinkedBlockingQueue
when I do a multithreaded version) 目前,我有一个使用深度优先搜索的单线程实现(它只需要爬网一个域,因此我可以选择DFS或BFS,并选择DFS,因为这意味着我可以使用队列而不是堆栈,并且因此,当我执行多线程版本时,请使用
LinkedBlockingQueue
)
I have a Queue
of links to visit and a HashSet
of already-visited links, and my main loop pops a link from the queue, visits the page, and adds any unvisited links from the page to the queue. 我有一个要访问的链接
Queue
和一个已经访问过的链接的HashSet
,我的主循环从队列中弹出一个链接,访问页面,并将页面中所有未访问的链接添加到队列中。
This is the contents of my class implementing my single-threaded implementation (if any of the throws
declarations are spurious please let me know why as I need to get to grips with that) 这是实现单线程实现的类的内容(如果任何
throws
声明是虚假的,请让我知道为什么,因为我需要解决这个问题)
private static LinkedBlockingQueue<String> URLSToCrawl = new LinkedBlockingQueue<String>();
private static String baseURL;
private static String HTTPSBaseURL;
private static HashSet<String> alreadyCrawledSet = new HashSet<String>();
private static List<String> deadLinks = new LinkedList<String>();
public static void main(String[] args) throws IOException, InterruptedException {
// should output a site map, showing the static assets for each page.
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
baseURL = args[0];
HTTPSBaseURL = baseURL.replace("http://", "https://");
alreadyCrawledSet.add(baseURL);
URLSToCrawl.add(baseURL);
while (!URLSToCrawl.isEmpty() ) {
String url = URLSToCrawl.take();
crawlURL(url);
}
}
private static void crawlURL(String url) throws IOException, InterruptedException {
print("%s", url);
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
String linkURL = link.attr("abs:href");
if (sameDomain(linkURL) && !alreadyCrawled(linkURL)) {
alreadyCrawledSet.add(linkURL);
URLSToCrawl.put(linkURL);
}
}
} catch (HttpStatusException e) {
deadLinks.add(url);
}
}
private static boolean alreadyCrawled(String url) {
if (alreadyCrawledSet.contains(url)) {
return true;
} else {
return false;
}
}
I would like to make this multithreaded, to take advantage of the fact that the single-threaded implementation has to wait for the HTTP request in the Jsoup.connect(url).get()
call to return before continuing to process. 我想使用多线程,以利用单线程实现必须等待
Jsoup.connect(url).get()
调用中的HTTP请求返回然后继续处理的Jsoup.connect(url).get()
。 I am hoping that by allowing multiple threads to act at once, some work will get done during this I/O-bound delay, and therefore speed up the program. 我希望通过允许多个线程同时作用,可以在此I / O约束延迟期间完成一些工作,从而加快程序的速度。
I am not very experienced with concurrency - my first thought was to simply create an Executor
and just submit every call to crawlURL
to it. 我对并发不是很有经验-我的第一个想法是简单地创建一个
Executor
然后将每个对crawlURL
调用提交给它。 But I am confused - I don't know how to make sure my HashSet
and Queue
are accessed in a threadsafe manner, especially given that each thread not only consumes URLs from the Queue
but also pushes new URLs onto the Queue
. 但是我很困惑-我不知道如何确保以线程安全的方式访问我的
HashSet
和Queue
,特别是考虑到每个线程不仅消耗 Queue
URL,还将新URL推送到Queue
。
I understand the basics of the concepts of atomicity, and the idea that threads can "lock" shared resources but I don't know how to put them into practice in this scenario. 我了解原子性概念的基础,以及线程可以“锁定”共享资源的想法,但是我不知道如何在这种情况下将它们付诸实践。
Does anyone have any advice for making this multithreaded? 有没有人建议使这种多线程?
My solution was to deal with one layer of the graph at a time. 我的解决方案是一次处理一张图表。 So for each level, submit each link to the
ExecutorService
to be crawled (multithreaded) but then wait for that level to be completed (using a CountDownLatch
) before moving onto the next level. 因此,对于每个级别,将每个链接提交到要爬网(多线程)的
ExecutorService
,然后等待该级别完成(使用CountDownLatch
),然后再进入下一个级别。
I used a FixedThreadPool
as a form of rate limiting. 我使用了
FixedThreadPool
作为速率限制的一种形式。
(Initially I tried to just dispatch every url asynchronously, which must be more efficient, but I couldn't figure out how to terminate the whole thing.) (最初,我试图异步地分派每个URL,这必须更加有效,但我无法弄清楚如何终止整个过程。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.