简体   繁体   中英

Bottleneck while uploading lots of files to GCP bucket in a small time

So I have a GCP bucket and I have to upload files to it. The issue is I have 10 million files I want to upload into the bucket (each file size is 50kb) and I have a time constraint of 8 hours or fewer. Currently, I am using a Java program ( google ref code ) and tested it on 1000 images and it uploads each file in about 300 milliseconds, but if I use multi-threading; I have been able to reduce the average time to 40 milliseconds (using 20 threads). I can go up to 60 threads and reduce the time further to 15-20 milliseconds but then also I face 3 problems:

  1. 20 milliseconds per file isn't fast enough. I need it to be at least 3 milliseconds or fewer.

  2. It throws “com.google.cloud.storage.StorageException: Connect timed out,” exception when I exceed 25 threads.

  3. Going beyond 60 threads, the programs don't seem to get any faster (I am guessing hardware constraint ).

Additional Info:

My internet speed is 700Mbps to 1.3 Gbps. I have thought about zipping and uploading but we have some constraint in that too so can't use that approach.

Thanks in advance.

You might have an hotspot on Cloud Storage. You can't check out this video that explain you why and how to solve the issue, ie add a hash in your file name, before the sequential sequence.

So I figured it out. The answer by guillaume blaquiere made sense but that didn't solve my issue. My issue was that there was a large number of small files. To improve the performance I did the following:

  1. I made zips of 1000 files (the logic behind 1000 explained at the end) which made each file to be approx 50-60 MB. That reduced my data set from 10 million files to 10000 files.

  2. I used the GSUtil to upload the files into the bucket and used a cloud function that had a trigger tied to the bucket to unzip the files. Since cloud function can make multiple instances it handled the processing of multithreaded uploads easily. Each unzipping took me about 40-50 seconds which included unzipping and some other operations. Just unzipping will I assume will take somewhere between 20-30 seconds.

  3. Uncommented and changed the following parameters in.boto (/Users/UserName/.boto) file:

    parallel_composite_upload_threshold = 120

    parallel_composite_upload_component_size = 50M

Why I used 1000 files for zipping:

parallel_composite_upload_component_size = 50M means that any file less than 50MB uploaded will not be broken down into chunks and 1000 files matched the threshold. I have tested on 1000,2000 and 5000 files zip and all of them takes the same amount of time relatively ( and also in each of the cases 100% of my bandwidth was being used; the time difference in each may be visible if I had higher bandwidth). And as for why 50M param, it's because on testing it was optimal for our use case.

Conclusion:

On testing this solution for 10000 zips (10 million files) I found that it consumed my entire bandwidth ie 200 Mbps. and took about 7 hours to upload.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM