简体   繁体   English

并行化数千次下载的最佳方式

[英]Best way to parallelize thousands of downloads

I am creating an application in which I have to download thousands of images (~1 MB each) using Java.我正在创建一个应用程序,我必须使用 Java 下载数千张图像(每张约 1 MB)。

I take a list of Album URLs in my REST request, each Album contains multiple number of images.我在我的 REST 请求中列出了相册 URL,每个相册包含多个图像。

So my request looks something like:所以我的请求看起来像:

[
  "www.abc.xyz/album1",
  "www.abc.xyz/album2",
  "www.abc.xyz/album3",
  "www.abc.xyz/album4",
  "www.abc.xyz/album5"
]

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.假设每个相册有 1000 张图片,那么我需要并行下载 50000 张图片。

Right now I have implemented it using parallelStream() but I feel that I can optimize it further.现在我已经使用parallelStream()实现了它,但我觉得我可以进一步优化它。

There are two principle classes - AlbumDownloader and ImageDownloader (Spring components).有两个主要类 - AlbumDownloaderImageDownloader (Spring 组件)。

So the main application creates a parallelStream() on the list of albums:所以主应用程序在专辑列表上创建了一个parallelStream()

albumData.parallelStream().forEach(ad -> albumDownloader.downloadAlbum(ad));

And a parallelStream() inside AlbumDownloader -> downloadAlbum() method as well:在 AlbumDownloader -> downloadAlbum() 方法中还有一个 parallelStream():

List<Boolean> downloadStatus = albumData.getImageDownloadData().parallelStream().map(idd -> imageDownloader.downloadImage(idd)).collect(Collectors.toList());

I am thinking about using CompletableFuture with ExecutorService but I am not sure what pool size should I use?我正在考虑将CompletableFutureExecutorService一起使用,但我不确定我应该使用什么池大小?

Should I create a separate pool for each Album?我应该为每个专辑创建一个单独的池吗?

ExecutorService executor = Executors.newFixedThreadPool(Math.min(albumData.getImageDownloadData().size(), 1000));

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.这将创建 5 个不同的池,每个池有 1000 个线程,这就像 5000 个线程可能会降低性能而不是提高性能。

Could you please give me some ideas to make it very very fast?你能给我一些想法让它变得非常快吗?

I am using Apache Commons IO FileUtils to download files by the way and I have a machine with 12 available CPU cores.顺便说一句,我正在使用 Apache Commons IO FileUtils下载文件,并且我有一台具有 12 个可用 CPU 内核的机器。

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.假设每个相册有 1000 张图片,那么我需要并行下载 50000 张图片。

It's wrong to think of your application doing 50000 things in parallel.认为您的应用程序并行执行 50000 件事情是错误的。 What you are trying to do is to optimize your throughput – you are trying to download all of the images in the shortest amount of time.您正在尝试做的是优化您的吞吐量 - 您正在尝试在最短的时间内下载所有图像。

You should try one fixed-sized thread-pool and then play around with the number of threads in the pool until your optimize your throughput – maybe start with double the number of processors.您应该尝试一个固定大小的线程池,然后调整池中的线程数量,直到优化吞吐量——也许从处理器数量的两倍开始。 If your application is mostly waiting for network or the server then maybe you can increase the number of threads in the pool but you wouldn't want to overload the server so that it slows to a crawl and you wouldn't want to thrash your application with a huge number of threads.如果您的应用程序主要在等待网络或服务器,那么也许您可以增加池中的线程数,但您不希望服务器超载以使其缓慢爬行并且您不希望破坏您的应用程序有大量线程。

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.这将创建 5 个不同的池,每个池有 1000 个线程,这就像 5000 个线程可能会降低性能而不是提高性能。

I see no point in multiple pools unless there are different servers for each album or some other reason why the downloads from each album are different.除非每张专辑有不同的服务器,或者每张专辑的下载量不同的其他原因,否则我认为多个池没有意义。

The only way to make it "very very fast" is to get a "very very fast" network connection to the server;使其“非常非常快”的唯一方法是获得与服务器的“非常非常快”的网络连接; eg co-locate your client with the server that you are downloading from.例如,将您的客户端与您正在下载的服务器放在一起。

Your download speeds are going to be constrained by a number of potential bottlenecks.您的下载速度将受到许多潜在瓶颈的限制。 These include:这些包括:

  1. The performance of the server;服务器的性能; ie how fast it can assemble the data to send to you and push it through its network interface.即它可以以多快的速度组装数据以发送给您并通过其网络接口推送它。

  2. Per-user request limits imposed by the service.服务施加的每用户请求限制。

  3. The end-to-end performance of the network path between your client and the server.客户端和服务器之间的网络路径的端到端性能。

  4. The performance of the machine you are running on in terms of moving data from the network and putting it (I guess) onto your local disk.您正在运行的机器在从网络移动数据并将其(我猜)放到本地磁盘方面的性能。

The bottleneck could be any of these, or a combination of them.瓶颈可能是这些中的任何一个,或它们的组合。

Throwing thousands of threads at the problem is unlikely to improve things.在问题上投入数千个线程不太可能改善问题。 Indeed, if anything it is likely to make performance less than optimal.事实上,如果有的话,它可能会使性能不太理想。 For example:例如:

  • it could congest your network link, or它可能会阻塞您的网络链接,或者
  • it could trigger anti-hogging or anti-DOS defenses in the server you are fetching from.它可能会在您从中获取的服务器中触发反占用或反 DOS 防御。

A better idea would be to use an ExecutorService with a small bounded worker pool, and submit the downloads to the pool as tasks.一个更好的主意是使用带有小型有界工作池的 ExecutorService,并将下载作为任务提交到池中。 (And try to keep HTTP / HTTPS connections open between downloads.) (并尝试在下载之间保持 HTTP / HTTPS 连接打开。)


I would also advise you to make sure that you have permission to do what you are doing.我还建议您确保您有权做您正在做的事情。 Companies in the music publishing business have good lawyers.音乐出版行业的公司拥有优秀的律师。 They could make your life unpleasant 1 if they perceive you to be violating their terms and conditions or stealing their intellectual property.如果他们认为您违反了他们的条款和条件或窃取了他们的知识产权,他们可能会让您的生活变得不愉快1

1 - Like blocking your IP address or issuing take-down requests to your service provider. 1 - 比如阻止您的 IP 地址或向您的服务提供商发出删除请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM