简体繁体 English

Java中的异步Web请求？

[英]Asynchronous Web Requests in Java?

原文 2010-12-08 04:27:47 5 2 java/ multithreading/ web-scraping/ httpurlconnection

I am writing a simple web crawler in Java. 我正在用Java编写一个简单的Web爬虫。 I want it to be able to download as many pages per second as possible. 我希望它能够每秒下载尽可能多的页面。 Is there a package out there that makes doing asynchronous HTTP web requests easy in Java? 是否有一个包使得Java中的异步HTTP Web请求变得容易？ I have used the HttpURLConnection but that is blocking. 我使用了HttpURLConnection但是阻塞了。 I also know there is something in Apache's HTTPCore NIO, but I am looking for something more lightweight. 我也知道Apache的HTTPCore NIO中有一些东西，但我正在寻找更轻量级的东西。 I tried using this package and I was getting better throughput using the HttpURLConnection on multiple threads. 我尝试使用这个包，我在多个线程上使用HttpURLConnection获得了更好的吞吐量。

2 个解决方案

Generally data intensive protocols tend to perform better in terms of a raw throughput with the classic blocking I/O compared than NIO as long as the number of threads is below 1000. At least that is certainly the case with the client side HTTP based on (likely imperfect and possibly biased) HTTP benchmark used by Apache HttpClient [1] 一般来说，只要线程数低于1000，数据密集型协议在原始吞吐量方面往往比NIO的经典阻塞I / O要好一些。至少在客户端基于HTTP的情况下就是如此（可能不完美且可能有偏见的Apache HttpClient使用的HTTP基准[1]

One may be much better off using a blocking HTTP client with threads as long as the number of threads is moderate (<250) 一个可以是好得多使用阻断HTTP客户端有螺纹，只要线程的数量是中度（<250）

If you are absolutely sure you want a NIO based HTTP client I can recommend Jetty HTTP client which I personally consider the best asynchronous HTTP client at the moment. 如果您完全确定需要基于NIO的HTTP客户端，我可以推荐Jetty HTTP客户端，我个人认为这是目前最好的异步HTTP客户端。

[1] http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore [1] http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore

While this user wasn't asking the same question, you may find answers to his question useful: Asynchronous HTTP Client for Java 虽然这个用户没有问同样的问题，但您可能会发现他的问题的答案很有用： Java的异步HTTP客户端

As a side-note, if you're going to download "as many pages per second as possible", you should bear in mind that crawlers can inadvertently grind a weak server to a halt. 作为旁注，如果您要下载“尽可能多的每秒页数”，您应该记住，爬虫可能会无意中将弱服务器停止运行。 You should probably read up on "robots.txt" and the appropriate way of interpreting this file before you unleash your creation on anything outside of your own personal test setup. 您可能应该阅读“robots.txt”以及解释此文件的相应方法，然后再在您自己的个人测试设置之外的任何内容上释放您的创建。