简体繁体中英

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

原文 2017-08-09 08:27:06 5 1 scala/ hadoop/ apache-spark/ hive/ tez

I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below

ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark

I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.

sample links :

link1

link2

link3

Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.

Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.

Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.

Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.

Method 1:

Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join

Method 2:

SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark

1 answers

The best way to implement the solution to your problem as below.

To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.

But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.

Hive support three execution engines

Spark
Tez
Mapreduce

Tez is matured, spark is evolving with various commits from Facebook and community.

Business can understand hive very easily as a query engine as it is much more matured in the industry.

In short use spark to process the data for daily processing and register them with hive.

Create business users in hive.

Spark sql Optimization Techniques loading csv to orc format of hive

Spark writing to hive as ORC error: Path is not a file

Flattening a nested ORC file with Spark - Performance issue

How to Updata an ORC Hive table form Spark using Scala

how to solve spark read hive orc file meet error

Better Hive-Spark connection?

Spark migrate sql window function to RDD for better performance

Spark SQL build for hive?

Spark SQL join really lazy?

spark sql hive connection error

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark sql Optimization Techniques loading csv to orc format of hive Spark writing to hive as ORC error: Path is not a file Flattening a nested ORC file with Spark - Performance issue How to Updata an ORC Hive table form Spark using Scala how to solve spark read hive orc file meet error Better Hive-Spark connection? Spark migrate sql window function to RDD for better performance Spark SQL build for hive? Spark SQL join really lazy? spark sql hive connection error

Related Tags

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

Question

1 answers

solution1 0 2017-08-13 05:42:22

solution1
0 2017-08-13 05:42:22