简体繁体 English

浏览URL文件并总结其大小的最快方法是什么？

[英]What is the fastest way to go through a file of URLs and sum up their size?

原文 2012-01-31 08:40:53 3 3 java/ http/ url

I have a file with approximately 200,000 documents urls. 我有一个包含大约200,000个文档URL的文件。 I want to sum up the sizes of these urls. 我想总结这些网址的大小。 I've written something i java, using HttpURLConnection but it takes a very long time to run, and that is of course understandable - it opens an http connection for each one. 我已经使用HttpURLConnection用java写了一些东西，但是运行起来需要很长时间，这当然是可以理解的-它为每个连接都打开了一个http连接。

Is there a faster way to do this? 有更快的方法吗？ Maybe the same thing in other language would take less time (if processing a single http connection in java takes a bit longer than in another language, then for my amount of connections its noteable)? 也许用其他语言进行的同一件事会花费更少的时间（如果在Java中处理单个http连接要比在另一种语言中花费更长的时间，那么对于我的连接量而言，它是值得注意的）？ Or another approach? 还是另一种方法？

3 个解决方案

Changing the language won't make a difference here, it's because opening 200,000 HTTP connections, however you look at it, takes a long time! 更改语言在这里没有什么不同，这是因为打开200,000个HTTP连接（无论您怎么看）都需要很长时间！

You could use a thread pool and execute the tasks concurrently which might speed things up quite a bit, but something like this is never going to run in a second or two. 您可以使用线程池并发执行任务，这可能会大大加快速度，但是这样的事情永远不会在一两秒钟内运行。

You should also use HEAD HTTP requests to only retrieve the Content-Length but not the content to speed up your process. 您还应该使用HEAD HTTP请求仅检索Content-Length，而不要检索内容，以加快处理速度。 Also the use of threads can speed up the process, especially when your line is not loaded very much by one request, which is probably not the case. 同样，使用线程也可以加快处理速度，特别是当一个请求未将行装载太多时，情况可能并非如此。 The last and probably most efficient option you have is to execute the process physically near by the server, eg in the same subnet or so. 您拥有的最后一个也是最有效的选择是在服务器附近物理上执行该过程，例如，在同一子网中。

Seems like you are approaching the problem in the wrong way. 似乎您以错误的方式来解决问题。 Your bottleneck isn't in counting the size of the URL, but in efficiently accessing them to determine the size of each file. 您的瓶颈不是计算URL的大小，而是有效地访问它们以确定每个文件的大小。 Luckily there are web services that can help you overcome this bottleneck, maybe try a service like 80 legs to run a cheap web crawler and then run analysis on the result set... 幸运的是，有一些Web服务可以帮助您克服这一瓶颈，也许可以尝试使用80条腿之类的服务来运行廉价的Web搜寻器，然后对结果集进行分析...

http://80legs.com/services.html http://80legs.com/services.html

Also, just a point of clarification - you are hoping to understand the size of the files described by the URL... not the actual URL itself, right? 另外，仅需澄清一下-您希望了解URL描述的文件大小...而不是实际的URL本身，对不对？