简体   繁体   中英

Best way to parallelize thousands of downloads

I am creating an application in which I have to download thousands of images (~1 MB each) using Java.

I take a list of Album URLs in my REST request, each Album contains multiple number of images.

So my request looks something like:

[
  "www.abc.xyz/album1",
  "www.abc.xyz/album2",
  "www.abc.xyz/album3",
  "www.abc.xyz/album4",
  "www.abc.xyz/album5"
]

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.

Right now I have implemented it using parallelStream() but I feel that I can optimize it further.

There are two principle classes - AlbumDownloader and ImageDownloader (Spring components).

So the main application creates a parallelStream() on the list of albums:

albumData.parallelStream().forEach(ad -> albumDownloader.downloadAlbum(ad));

And a parallelStream() inside AlbumDownloader -> downloadAlbum() method as well:

List<Boolean> downloadStatus = albumData.getImageDownloadData().parallelStream().map(idd -> imageDownloader.downloadImage(idd)).collect(Collectors.toList());

I am thinking about using CompletableFuture with ExecutorService but I am not sure what pool size should I use?

Should I create a separate pool for each Album?

ExecutorService executor = Executors.newFixedThreadPool(Math.min(albumData.getImageDownloadData().size(), 1000));

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.

Could you please give me some ideas to make it very very fast?

I am using Apache Commons IO FileUtils to download files by the way and I have a machine with 12 available CPU cores.

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.

It's wrong to think of your application doing 50000 things in parallel. What you are trying to do is to optimize your throughput – you are trying to download all of the images in the shortest amount of time.

You should try one fixed-sized thread-pool and then play around with the number of threads in the pool until your optimize your throughput – maybe start with double the number of processors. If your application is mostly waiting for network or the server then maybe you can increase the number of threads in the pool but you wouldn't want to overload the server so that it slows to a crawl and you wouldn't want to thrash your application with a huge number of threads.

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.

I see no point in multiple pools unless there are different servers for each album or some other reason why the downloads from each album are different.

The only way to make it "very very fast" is to get a "very very fast" network connection to the server; eg co-locate your client with the server that you are downloading from.

Your download speeds are going to be constrained by a number of potential bottlenecks. These include:

  1. The performance of the server; ie how fast it can assemble the data to send to you and push it through its network interface.

  2. Per-user request limits imposed by the service.

  3. The end-to-end performance of the network path between your client and the server.

  4. The performance of the machine you are running on in terms of moving data from the network and putting it (I guess) onto your local disk.

The bottleneck could be any of these, or a combination of them.

Throwing thousands of threads at the problem is unlikely to improve things. Indeed, if anything it is likely to make performance less than optimal. For example:

  • it could congest your network link, or
  • it could trigger anti-hogging or anti-DOS defenses in the server you are fetching from.

A better idea would be to use an ExecutorService with a small bounded worker pool, and submit the downloads to the pool as tasks. (And try to keep HTTP / HTTPS connections open between downloads.)


I would also advise you to make sure that you have permission to do what you are doing. Companies in the music publishing business have good lawyers. They could make your life unpleasant 1 if they perceive you to be violating their terms and conditions or stealing their intellectual property.

1 - Like blocking your IP address or issuing take-down requests to your service provider.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM