简体   繁体   English

Pentaho和Hadoop

[英]Pentaho and Hadoop

I am sorry if this question seems naive, But I am new to Data engineering field, as I am self learner right now, however my questions is what is the differences between ETL products like Pentaho and Hadoop? 如果这个问题看起来很幼稚,我感到很抱歉,但是我是数据学习领域的新手,因为我现在是自学成才,但是我的问题是Pentaho和Hadoop等ETL产品之间的区别是什么? when I use this instead of that? 当我用它代替那? or I may use them together, how? 或者我可以一起使用它们,怎么办?

Thank you, 谢谢,

An ETL is a tool to Extract data, Transform (join, enrich, filter,...) it and Load the result in another data store. ETL是一种提取数据,转换(联接,丰富,过滤等)并将结果加载到另一个数据存储中的工具。 Good ETLS are visual, data store agnostic and easy to automate. 良好的ETLS是可视的,与数据存储无关,并且易于自动化。

Hadoop is a data store distributed on a network of clusters plus software to handle diseminated data. Hadoop是分布在集群网络和用于处理分散数据的软件的数据存储。 The data transformation is specialized on few elementary operations which can be optimized to this usually massive amount of data, like (but not only) Map-Reduce. 数据转换专门针对一些基本操作,这些操作可以针对通常数量庞大的数据进行优化,例如(但不仅限于)Map-Reduce。

Pentaho Data Integrator has connectors to Hadoop systems which are easy to set up and tune up. Pentaho Data Integrator具有到Hadoop系统的连接器,这些连接器易于设置和调整。 So the best strategy is to setup a Hadoop network as data store and manipulate it through the PDI. 因此,最好的策略是将Hadoop网络设置为数据存储并通过PDI对其进行操作。

Pentaho PDI is a tool for creating, managing, running and monitoring ETL workflows. Pentaho PDI是用于创建,管理,运行和监视ETL工作流的工具。 It can work with Hadoop, RDBMS, Queues, files, etc. Hadoop is a platform for distributed computation (Map-Reduce framework, HDFS, etc). 它可以与Hadoop,RDBMS,队列,文件等配合使用。Hadoop是用于分布式计算(Map-Reduce框架,HDFS等)的平台。 Many tools can run on Hadoop or can connect to Hadoop and use it's data, run processes. 许多工具可以在Hadoop上运行,也可以连接到Hadoop并使用其数据,运行流程。

Pentaho PDI can connect to Hadoop using it's own connectors and write/read data. Pentaho PDI可以使用自己的连接器连接到Hadoop并写入/读取数据。 You can start Hadopp job from PDI, also it can process data by itself inside transformation flow and store or send results to HDFS, RDBMS, some queue, email, etc. Of course you can invent you own tool for ETL workflows or simply use bash+Hive, etc, but PDI allows ETL processsing in a unified way not depending on data sources and targets. 您可以从PDI开始Hadopp作业,它也可以在转换流中自行处理数据,并将结果存储或发送到HDFS,RDBMS,一些队列,电子邮件等。当然,您可以为ETL工作流程发明自己的工具,也可以简单地使用bash + Hive等,但是PDI允许以统一的方式处理ETL,而不依赖于数据源和目标。 Also Pentaho has great visualization. Pentaho的可视化效果也很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM