简体   繁体   English

Google Data Fusion 能否进行与 DataPrep 相同的数据清理?

[英]Can Google Data Fusion make the same data cleaning than DataPrep?

I want to run a machine learning model with some data.我想用一些数据运行机器学习 model。 Before train the model with this data I need to process it, so I have been reading some ways to do it.在用这些数据训练 model 之前,我需要处理它,所以我一直在阅读一些方法。

  1. First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage, then create a data pipeline with Google Dataprep to clean it.首先创建一个 Dataflow 管道将其上传到 Bigquery 或 Google Cloud Storage,然后使用 Google Dataprep 创建一个数据管道来清理它。

  2. The other way I reat to do it is with Data Fusion, that can create data pipelines more easier, but I don't know and here is my doubt, data Fusion it is only to create a pipeline like Dataflow and then I have to use DataPrep to clean the data or if Data Fusion can clean the data and prepare it to put into my machine learning model.我想这样做的另一种方法是使用数据融合,它可以更轻松地创建数据管道,但我不知道,这是我的疑问,数据融合只是创建像 Dataflow 这样的管道,然后我必须使用DataPrep 清理数据,或者 Data Fusion 是否可以清理数据并准备将其放入我的机器学习 model。

If Data Fusion can clean the data as DataPrep, when I should use DataPrep?如果 Data Fusion 可以像 DataPrep 一样清理数据,我应该什么时候使用 DataPrep?

Datafusion and Dataprep can perform the same things. Datafusion 和 Dataprep 可以执行相同的操作。 However their execution are different.但是它们的执行方式不同。

  • Datafusion create a Spark pipeline and run it on Dataproc cluster Datafusion 创建一个 Spark 管道并在 Dataproc 集群上运行它
  • Dataprep create a Beam pipeline and run it on Dataflow Dataprep 创建一个 Beam 管道并在 Dataflow 上运行它

IMO, Datafusion is more designed for data ingestion from one source to another one, with few transformation. IMO,Datafusion 更适合从一个来源到另一个来源的数据摄取,几乎没有转换。 Dataprep is more designed for data preparation (as its name means), data cleaning, new column creation, splitting column. Dataprep 更多地是为数据准备(顾名思义)、数据清理、新列创建、拆分列而设计的。 Dataprep also provide insight of the data for helping you in your recipes. Dataprep 还提供对数据的洞察力,以帮助您制定食谱。

In addition, Beam is a part of Tensorflow extended and your Data engineer pipeline will be more consistent if you use a tool compliant with Beam此外,Beam 是Tensorflow 扩展的一部分,如果您使用与 Beam 兼容的工具,您的数据工程师管道将更加一致

That's why I will recommend Dataprep instead Datafusion.这就是为什么我会推荐 Dataprep 而不是 Datafusion。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM