Spark (2.2) Performance <-> Spark Persist

Question

I have a whitelist (wl) of users and items, from which I would like to subtact the users and items which are blacklisted (bl).This is done using a left anti join. Both resulting lists are then combined using a crossJoin

The issue is that doing this takes forever, even for an absolute minimal case (I eventually get an out-of-memory exception, even on an entire Spark Cluster) -> see attached code. However, when I do the same thing using persist(), the same minimal case takes a few seconds to run.

Specifically:

from pyspark.sql import DataFrame, SparkSession
spark: SparkSession = SparkSession.builder.appName("dummy").getOrCreate()

# preparing dummy data
bl_i_data = [(20,), (30,), (60,)]
bl_i = spark.createDataFrame(bl_i_data, ["i_id"])
bl_u_data = [(1,), (3,), (6,)]
bl_u = spark.createDataFrame(bl_u_data, ["u_id"])
wl_u_data = [(1,), (2,), (3,), (4,), (5,)]
wl_u = spark.createDataFrame(wl_u_data, ["u_id"])
wl_i_data = [(20,), (30,), (40,), (50,), (60,)]
wl_i = spark.createDataFrame(wl_i_data, ["i_id"])

# combining wls and bls
l_u = wl_u.join(bl_u, on="u_id", how="left_anti")
l_i = wl_i.join(bl_i, on="i_id", how="left_anti")

# Takes forever to run:
u_i = l_u.crossJoin(l_i)
u_i.count()

# works fine if users and items get presisted first:
# l_u.persist()
# l_u.count()
# l_i.persist()
# l_i.count()
# u_i = l_u.crossJoin(l_i)
# u_i.count()

Does anyone have a good explanation, as to what exactly is happening and/or seen this behaviour before? I would like to avoid using persist(), as I don't want to do the memory management myself.

Answer 1

You can look at the spark's execution plan by calling explain() . Add it your code like following.

u_i = l_u.crossJoin(l_i)
print(u_i.explain())
u_i.count()

Following are explain plan without and with persist. The join call in spark leads to lot of data shuffle between executors which can cause performance degradation. Spark does try to optimize this shuffle by doing a broadcast of right side dataframe if its size is below a certain default threshold. The broadcast avoids shuffle as all the data is already available on each executor.

When you do persist and count, dataframe is precalculated and spark knows the size of right side data and its able to broadcast them, hence avoiding shuffle. In case without persist, dataframe is calculated on the fly and shuffled to executors, resulting in delay.

without persist:

== Physical Plan ==
CartesianProduct
:- SortMergeJoin [u_id#1092L], [u_id#1090L], LeftAnti
:  :- *(1) Sort [u_id#1092L ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(u_id#1092L, 200)
:  :     +- Scan ExistingRDD[u_id#1092L]
:  +- *(2) Sort [u_id#1090L ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(u_id#1090L, 200)
:        +- Scan ExistingRDD[u_id#1090L]
+- SortMergeJoin [i_id#1094L], [i_id#1088L], LeftAnti
   :- *(3) Sort [i_id#1094L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(i_id#1094L, 200)
   :     +- Scan ExistingRDD[i_id#1094L]
   +- *(4) Sort [i_id#1088L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(i_id#1088L, 200)
         +- Scan ExistingRDD[i_id#1088L]

with persist:

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Cross
:- *(1) InMemoryTableScan [u_id#1002L]
:     +- InMemoryRelation [u_id#1002L], true, 10000, StorageLevel(disk, memory, 1 replicas)
:           +- SortMergeJoin [u_id#1002L], [u_id#1000L], LeftAnti
:              :- *(1) Sort [u_id#1002L ASC NULLS FIRST], false, 0
:              :  +- Exchange hashpartitioning(u_id#1002L, 200)
:              :     +- Scan ExistingRDD[u_id#1002L]
:              +- *(2) Sort [u_id#1000L ASC NULLS FIRST], false, 0
:                 +- Exchange hashpartitioning(u_id#1000L, 200)
:                    +- Scan ExistingRDD[u_id#1000L]
+- BroadcastExchange IdentityBroadcastMode
   +- *(2) InMemoryTableScan [i_id#1004L]
         +- InMemoryRelation [i_id#1004L], true, 10000, StorageLevel(disk, memory, 1 replicas)
               +- SortMergeJoin [i_id#1004L], [i_id#998L], LeftAnti
                  :- *(1) Sort [i_id#1004L ASC NULLS FIRST], false, 0
                  :  +- Exchange hashpartitioning(i_id#1004L, 200)
                  :     +- Scan ExistingRDD[i_id#1004L]
                  +- *(2) Sort [i_id#998L ASC NULLS FIRST], false, 0
                     +- Exchange hashpartitioning(i_id#998L, 200)
                        +- Scan ExistingRDD[i_id#998L]

Spark (2.2) Performance <-> Spark Persist

Question

1 answers

solution1
1 2019-01-12 16:25:27

Spark (2.2) Performance <-> Spark Persist

Question

1 answers

solution1 1 2019-01-12 16:25:27

solution1
1 2019-01-12 16:25:27