简体繁体 English

Spark可以替换ETL工具

[英]Can Spark Replace ETL Tool

原文 2016-11-25 21:40:21 2 2 amazon-web-services/ apache-spark/ etl/ data-warehouse/ pyspark-sql

Existing process - raw structure data are copied into a staging layer of Redshift. 现有流程 - 原始结构数据被复制到Redshift的暂存层。 Then use ETL tools such as Informatica, Telend to do incremental loading into Fact and Dimension table of Datamart/datawarehouse. 然后使用Informatica，Telend等ETL工具对Datamart / datawarehouse的Fact and Dimension表进行增量加载。 All joins happen within database layer(ETL pushes queries into DB) - Can Spark replace ETL tool and do the same processing and load data into Redshift? 所有连接都发生在数据库层中（ETL将查询推送到DB中） - Spark可以替换ETL工具并执行相同的处理并将数据加载到Redshift中吗？ - What are the advantages and disadvantages of this architecture? - 这种架构有哪些优点和缺点？

2 个解决方案

I have worked extensively on projects to migrate the existing ETL jobs into spark for last 4 years. 我已经在项目上进行了大量工作，将现有的ETL工作迁移到了过去4年的火花中。

The problem of ETL jobs were as follows ETL工作的问题如下

They didn't give us a strict SLA. 他们没有给我们严格的SLA。 The jobs were sharing the same resource pool, thus prioritizing was hard. 作业共享相同的资源池，因此优先级很难。 Everyone made their jobs as business critical . 每个人都把自己的工作视为business critical 。
Another important problem was the cost of the ETL based job was high as we were paying the provider. 另一个重要问题是基于ETL的工作成本很高，因为我们向提供商付款。
Scale was another important issue. 规模是另一个重要问题。 We required ETL on gigantic scale, that we found too expensive. 我们需要大规模的ETL，我们觉得它太贵了。

Thus, we migrated all the ETLs to spark jobs. 因此，我们将所有ETL迁移到了火花作业。 Spark and hadoop both being open source we didn't have any additional cost issue except the compute. Spark和hadoop都是开源的，除了计算之外我们没有任何额外的成本问题。

Spark support for SQL improved dramatically over the time. SQL的Spark支持随着时间的推移而显着改善。 You can run ML/Graph queries and normal ETL on the same data frame. 您可以在同一数据帧上运行ML / Graph查询和普通ETL。 Spark joins are fast and can be optimized for different dataset. Spark连接速度很快，可以针对不同的数据集进行优化。 You get more fine-grained control over your transformations and join. 您可以对转换和连接进行更细粒度的控制。

We started by using a Long running cluster with the support for spark and other big data tools. 我们首先使用Long运行集群，支持spark和其他大数据工具。 We unified the platform so that all the customer can use it. 我们统一了平台，以便所有客户都可以使用它。 We slowly migrated all the ETL jobs to spark jobs. 我们慢慢地将所有ETL工作迁移到了工作岗位上。

We do use Redshift for reporting but all the heavy lifting of finding insights from data, joins, managing incoming data and merge that with existing snapshot all done in spark. 我们确实使用Redshift进行报告，但是从数据，联接，管理传入数据以及将现有快照合并到火花中的所有重要提升。

We were able to save millions of dollars by moving away from existing ETL jobs and migrating them to Spark. 通过远离现有的ETL作业并将它们迁移到Spark，我们节省了数百万美元。

My two pennies on this is that, eventually spark, hive big data, hadoop will eventually outrun the ETL jobs. 我的两个便士是，最终火花，蜂巢大数据，hadoop最终将超过ETL的工作。 I am not saying ETLs will be eviscerated but definitely the open source solution will become the dominant force in this domain. 我不是说ETL会被剔除，但绝对是开源解决方案将成为这个领域的主导力量。

May I know the reason for replacing Informatica with Spark. 我是否知道用Spark替换Informatica的原因。 Informatica BDM 10.1 edition comes with Spark execution engine, This converts the Informatica mappings into Spark equivalent (Scala code) and executes this on the cluster. Informatica BDM 10.1版本附带Spark执行引擎，它将Informatica映射转换为Spark等效（Scala代码）并在集群上执行此操作。 Also, in my opinion, Spark is more suitable for the data that does not intermediately, where as in case of ETL, the data changes from transformation to transformation!!! 另外，在我看来，Spark更适合不在中间的数据，在ETL的情况下，数据从转换到转换！