简体繁体 English

在组件之间共享数据框的正确方法是什么？

[英]What is the correct way for share dataframes between components?

原文 2021-05-07 11:04:39 8 2 python/ dataframe/ amazon-s3/ kubeflow/ kubeflow-pipelines

I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.我正在使用 Kubeflow 的遗留项目，管道有一些组件，以便将某种过滤器应用于数据框。

In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.为了做到这一点，每个组件从 S3 下载数据帧应用过滤器并再次将其上传到 S3。

In the components where the data frame is used for training or validating the models, download from S3 the data frame.在数据框用于训练或验证模型的组件中，从 S3 下载数据框。

The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.问题是这是否是最佳实践，或者直接在组件之间共享数据帧更好，因为上传到 S3 可能会失败，然后管道失败。

Thanks谢谢

2 个解决方案

As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".与询问“最佳”或“推荐”方法的问题一样，主要答案是：“视情况而定”。

However, there are certain considerations worth spelling out in your case.但是，在您的情况下，有一些值得说明的注意事项。

Saving to S3 in between pipeline steps.在管道步骤之间保存到 S3。 This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that.这存储了管道的中间结果，只要这些步骤需要很长时间并且可以重新启动，就值得这样做。 What "long time" means is dependent on your use case though.不过，“长时间”的含义取决于您的用例。
Passing the data directly from component to component.将数据直接从组件传递到组件。 This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3.这为您节省了存储吞吐量，并且很可能节省了在 S3 中存储和检索数据的时间。 The downside being: if you fail mid-way in the pipeline, you have to start from scratch.缺点是：如果你在管道中途失败，你必须从头开始。

So the questions are:所以问题是：

Are the steps idempotent (restartable)?这些步骤是幂等的（可重新启动）吗？
How often the pipeline fails?管道多久出现一次故障？
Is it easy to restart the processing from some mid-point?从某个中间点重新开始处理是否容易？
Do you care about the processing time more than the risk of loosing some work?您是否更关心处理时间而不是失去一些工作的风险？
Do you care about the incurred cost of S3 storage/transfer?您是否关心 S3 存储/传输所产生的成本？

The question is about if this is a best practice问题是这是否是最佳实践

The best practice is to use the file-based I/O and built-in data-passing features.最佳实践是使用基于文件的 I/O 和内置的数据传递功能。 The current implementation uploads the output data to storage in upstream components and downloads the data in downstream components.当前实现将 output 数据上传到上游组件中的存储，并下载下游组件中的数据。 This is the safest and most portable option and should be used until you see that it no longer works for you (100GB datasets will probably not work reliably).这是最安全、最便携的选项，应该一直使用，直到您发现它不再适合您（100GB 数据集可能无法可靠地工作）。

or is better to share the data frame directly between components或者直接在组件之间共享数据框更好

How can you "directly share" in-memory python object between different python programs running in containers on different machines?如何在不同机器上的容器中运行的不同 python 程序之间“直接共享”内存中 python object 程序？

because the upload to the S3 can fail, and then fail the pipeline.因为上传到 S3 可能会失败，然后管道失败。

The failed pipeline can just be restarted.失败的管道可以重新启动。 The caching feature will make sure that already finished tasks won't be re-executed.缓存功能将确保已经完成的任务不会被重新执行。

Anyways, what is the alternative?无论如何，有什么替代方案？ How can you send the data between distributed containerized programs without sending it over the network?如何在分布式容器化程序之间发送数据而不通过网络发送？