简体   繁体   中英

Spark Performance Issue vs Hive

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.

Following are the facts about the environment,

For table x:

  • Data size: 18 MB
  • Number of files in a partition : ~191
  • file type: parquet

For table y:

  • Data size: 1.5 GB
  • Number of files in a partition : ~3200
  • file type: parquet

Now the problem is:

Hive and Spark are giving same performance (time taken is same)

I tried different combination of resources for spark job.

eg:

  • executors:50 memory:20GB cores:5
  • executors:70 memory:20GB cores:5
  • executors:1 memory:20GB cores:5

All three combinations are giving same performance. I am not sure what I am missing here.

I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.

One key observations is:

70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.

I am not sure how hive is giving the same performance.

Kindly suggest.

I think the main issue is that there are too many small files. A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.

My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y' dataframe into 6-7 partitions.

After doing the above, please perform the join(broadcastHashJoin).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM