简体繁体 English

Kubeflow 如何使用大量数据？

[英]How to use large volumes of data in Kubeflow?

原文 2019-04-12 14:12:10 0 2 kubernetes/ google-cloud-platform/ kubeflow

I have 1TB of images stored in GCS (data is splitted into 3 classes).我有 1TB 的图像存储在 GCS 中（数据分为 3 个类）。 I want to train custom Tensor Flow model on this data in Kubeflow.我想在 Kubeflow 中根据这些数据训练自定义 Tensor Flow 模型。 Currently, I have pipeline components for training and persisting the model but I don't know how to correctly feed this data into the classifier.目前，我有用于训练和持久化模型的管道组件，但我不知道如何正确地将这些数据输入到分类器中。

It seems to me like downloading this data from GCS (gsutil cp / something other) every time I run (possibly with fail) the pipeline is not a proper way to do this.在我看来，每次我运行（可能失败）时，都喜欢从 GCS（gsutil cp/其他东西）下载这些数据，管道不是执行此操作的正确方法。

How to use large volumes of data in Kubeflow pipelines without downloading them every time?如何使用 Kubeflow 管道中的大量数据而无需每次都下载？ How to express access to this data using Kubeflow DSL?如何使用 Kubeflow DSL 表达对这些数据的访问？

2 个解决方案

Additionally, if your data is in GCS, then TensorFlow supports the ability to access data in (and write to) GCS.此外，如果你的数据是在GCS，然后TensorFlow支持（和写入）GCS的能力来访问数据。 The tf.data api lets you set up a performant data input pipeline. tf.data api可让您设置高性能数据输入管道。

Can you mount the volume on host machine?您可以在主机上安装卷吗？

If yes, mount the volume on host and then mount this directory to containers as hostPath so images are already mounted to node and whenever new container is up it can mount volume to container and start the process avoiding data transfer on each container startup.如果是，则将卷挂载到主机上，然后将此目录作为hostPath挂载到容器，以便图像已经挂载到节点，并且每当新容器启动时，它都可以将卷挂载到容器并启动该过程，避免在每次容器启动时传输数据。