简体   繁体   中英

Spark performance issue on Join of two Tables

I have two big Hive tables which I want to join with spark.sql. Let's say we have table 1 and table 2 with 5 million rows in table 1 and 70 million rows on table 2. Tables are in snappy format and stored as parquet files in Hive.

I want to join them and take some aggregations on some columns lets say count all rows and the average of a column (eg doubleColumn) while filtering with two condition (lets say on col1,col2).

Note: I work in our test installation on a single machine (which is quite powerful though). I expect that performance would probably be different in a cluster.

My first try is to use spark sql like:

 val stat = sqlContext.sql("select count(id), avg(doubleColumn) " +
                              " FROM db.table1 as t1 JOIN db.table2 " +
                              " ON t1.id = t2.id " + 
                              " WHERE col1 = val1 AND col2 = val2").collect

Unfortunately this run very poorly about 5 minutes even when I give at least 8 gb memory per executor and driver. I also tried to use dataframe syntax and try to filter the rows first and select only specific columns to have better selectivity like:

//Filter first and select only needed column
val df = spark.sql("SELECT * FROM db.tab1")
val tab1= df.filter($"col1" === "val1" && $"col2" === "val2").select("id")

val tab2= spark.sql("SELECT id, doubleColumn FROM db.tab2")
val joined = tab1.as("d1").join(tab2.as("d2"), $"d1.id" === $"d2.id") 

//Take the aggregations on the joined df
import org.apache.spark.sql.functions;

joined.agg(
   functions.count("id").as("count"),
   functions.avg("doubleColumn").as("average")
).show();

But this has no significant performance gain. How can I improve the performance in join?

  • Which is the best way to do this spark.sql or dataframe syntax?

  • Giving more executors or memory will help?

  • Should I use cache?
    I cached both dataframes tab1,tab2 and join aggregation had significant gain but I don't think is practical to cache my dataframes as we are interested in concurrency many users simultaneously asking the some analytical query.

  • Is there nothing to do because I work on single node and my problems would go away when I go to production environment on a cluster?

Bonus question: I tried this query with Impala and it did about 40 seconds but it is was way better than spark.sql. How can Impala be better than spark?!

Which is the best way to do this spark.sql or dataframe syntax?

There is no difference whatsoever.

Giving more executors or memory will help?

Only if problems are not caused by data skews and you correctly adjust configuration.

Should I use cache?

If input data is reused multiple times, then it might be advisable (as you already determined) performance-wise.

Is there nothing to do because I work on single node and my problems would go away when I go to production environment on a cluster?

In general performance testing on a single node is completely useless. It misses both bottlenecks (Network IO / Communication) and advantages (amortized disk I/O and resource usage).

However you can significantly reduce parallelsm ( spark.sql.shuffle.partitions , sql.default.parallelism and increased input split size). Counterintuitiv Spark-style parallelism, which is designed for distributing load, is more a liability on a single machine, than an asset. It depends on shuffles (disk writes!) for communication making things extremely slow compared to shared memory, and scheduling overhead is significant.

How can Impala be better than spark?!

Because it is specifically designed for low latency concurrent queries. It is not something that was ever a goal of Spark (database vs. ETL framework).

As you

as we are interested in concurrency many users simultaneously asking the some analytical query.

Spark just doesn't sound like a right choice.

You can change the configs, and you would have to change them on a large cluster anyway. I can think of two things right away. Set spark.executor.cores to 5 and also depending on the memory, give more executors and more memory with spark.executor.instances and spark.executor.memory . Also can you bucket and sort the hive tables by some column? If you bucket the table then it will erase the need to sort the tables before joining them.

It might also be faster if you cached the dataframe after the join, depending on how catalyst handles the aggregation query. You can unpersist() also after the query is over but I agree the GC might not make it worth it.

You won't see any benefits using SQL or the scala dsl. Both use full stage code generation so they are essentially the same.

One reason Impala is always faster is because it never worries about replication, although with one node that shouldn't be bothering as much but there might not be a graceful separation for spark between preoaring the data for replication and not needing to replicate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM