简体   繁体   中英

OR condition in dataframe full outer join reducing performance spark/scala

I'm seeing that if my join condition involves OR clause, it takes a long time vs having an AND clause. I'm doing fullouter join.

My unit test doesn't show a major difference but when run against a large dataset, it runs very slow

df1.join(df2, expr("id1 = id2 AND amount1 = amount2"), "fullouter").cache()

The following runs for long time ie magnitude of ~10 times slower

df1.join(df2, expr("id1 = id2 OR amount1 = amount2"), "fullouter").cache()

any thoughts?

TL;DR Performance difference between these two is expected.

  • Logical conjunction ( AND ) can be expressed as a shuffle based-operation (either sort merge join or shuffle hash join) - this means that each record has to be transferred to only one child partition, based on hash of all expressions included in the join condition.

    While in the worst case scenario (all join keys being constant and matching between both datasets) it can take N*M comparisons with all records shuffled to a single partition, typical scenarios, with real-life data, should be much more efficient, making roughly N + M comparisons.

  • Logical disjunction ( OR ) cannot be expressed as a simple shuffle-based operation as records hashing to different buckets can be matched*. Therefore Spark uses Cartesian product followed by selection always transferring each record multiple times and making N*M comparisons.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM