简体   繁体   中英

How to use large volumes of data in Kubeflow?

I have 1TB of images stored in GCS (data is splitted into 3 classes). I want to train custom Tensor Flow model on this data in Kubeflow. Currently, I have pipeline components for training and persisting the model but I don't know how to correctly feed this data into the classifier.

It seems to me like downloading this data from GCS (gsutil cp / something other) every time I run (possibly with fail) the pipeline is not a proper way to do this.

How to use large volumes of data in Kubeflow pipelines without downloading them every time? How to express access to this data using Kubeflow DSL?

Additionally, if your data is in GCS, then TensorFlow supports the ability to access data in (and write to) GCS. The tf.data api lets you set up a performant data input pipeline.

Can you mount the volume on host machine?

If yes, mount the volume on host and then mount this directory to containers as hostPath so images are already mounted to node and whenever new container is up it can mount volume to container and start the process avoiding data transfer on each container startup.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM