is there a simple, efficient way to download (open) 5000+ images from google storage to a python notebook?

Question

I have a jupyter notebook (python) running at google AI platform. in order to read a file into the notebook from google storage i'm using:

blob = storage.blob.Blob(filename,bucket)
content = blob.download_to_filename(filename)

is there a simple way to point to a bucket directory and make reading 5K+ images more easier, efficient and transparent to the pipeline? thanks, N

Answer 1

The easiest way is to use gsutil command with parallelism:

!gcloud -m cp gs://<your bucket>/* /<your local path>/

Add -r if the images are also in subdirectory. Here a video

If the download is still slow, look at the number of vCPU that you have for your notebook. The bandwidth is limited to 2Gbps per vCPU up to 8 vCPU.

For increasing again the performance, take care of hotspots. Indeed, if the names of your image are too similar, it's the same shard which serve it and you have contention. Here a video which describe this

However, generally, it's not required to have all the images in your Jupiter Notebook. You have to perform/validate your model on a small set of data before running it on dedicated server and to really train your model.

is there a simple, efficient way to download (open) 5000+ images from google storage to a python notebook?

Question

1 answers

solution1
0 2019-10-21 04:22:31

is there a simple, efficient way to download (open) 5000+ images from google storage to a python notebook?

Question

1 answers

solution1 0 2019-10-21 04:22:31

solution1
0 2019-10-21 04:22:31