简体繁体中英

Spark Performance Issue vs Hive

原文 2019-07-05 10:13:56 5 1 apache-spark/ hadoop/ hive/ hdfs

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.

Following are the facts about the environment,

For table x:

Data size: 18 MB
Number of files in a partition : ~191
file type: parquet

For table y:

Data size: 1.5 GB
Number of files in a partition : ~3200
file type: parquet

Now the problem is:

Hive and Spark are giving same performance (time taken is same)

I tried different combination of resources for spark job.

eg:

executors:50 memory:20GB cores:5
executors:70 memory:20GB cores:5
executors:1 memory:20GB cores:5

All three combinations are giving same performance. I am not sure what I am missing here.

I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.

One key observations is:

70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.

I am not sure how hive is giving the same performance.

Kindly suggest.

1 answers

I think the main issue is that there are too many small files. A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.

My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y' dataframe into 6-7 partitions.

After doing the above, please perform the join(broadcastHashJoin).

Performance of spark while reading from hive vs parquet

Query performance in spark-submit vs hive shell

spark windowing function VS group by performance issue

Spark SQL vs HIVE on Spark

Issue on configure hive on spark

Spark Warehouse VS Hive Warehouse

Spark on Parquet vs Spark on Hive(Parquet format)

Spark Cassandra Performance Issue

Spark joins performance issue

Select count(*) issue with hive and spark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Performance of spark while reading from hive vs parquet Query performance in spark-submit vs hive shell spark windowing function VS group by performance issue Spark SQL vs HIVE on Spark Issue on configure hive on spark Spark Warehouse VS Hive Warehouse Spark on Parquet vs Spark on Hive(Parquet format) Spark Cassandra Performance Issue Spark joins performance issue Select count(*) issue with hive and spark

Related Tags

Spark Performance Issue vs Hive

Question

Hive and Spark are giving same performance (time taken is same)

1 answers

solution1 0 2019-07-05 13:08:46

solution1
0 2019-07-05 13:08:46