简体   繁体   English

Apache Spark、范围连接、数据倾斜和性能

[英]Apache Spark, range-joins, data skew and performance

I have the following Apache Spark SQL join predicate:我有以下 Apache Spark SQL 连接谓词:

t1.field1 = t2.field1 and t2.start_date <= t1.event_date and t1.event_date < t2.end_date

data:数据:

t1 DataFrame have over 50 millions rows
t2 DataFrame have over 2 millions rows

almost all t1.field1 fields in t1 DataFrame have the same value( null ). t1 DataFrame 中几乎所有t1.field1字段都具有相同的值( null )。

Right now the Spark cluster hangs for more than 10 minutes on a single task in order to perform this join and because of data skew.现在,Spark 集群在单个任务上挂起超过 10 分钟,以执行此连接并且由于数据倾斜。 Only one worker and one task on this worker works at this point of time.此时只有一名工人和该工人上的一项任务在工作。 All other 9 workers are idle.所有其他 9 个工人都处于空闲状态。 How to improve this join in order to distribute the load from this one particular task to whole Spark cluster?如何改进此连接以将来自此特定任务的负载分配到整个 Spark 集群?

I am assuming you are doing inner join. 我假设您正在进行内部联接。

Below steps can be followed to optimise join - 1. Before joining we can filter out t1 and t2 based on smallest or largest start_date, event_date, end_date. 可以按照以下步骤优化连接-1.在连接之前,我们可以根据最小或最大的start_date,event_date,end_date筛选出t1和t2。 It will reduce number of rows. 它将减少行数。

  1. Check if t2 dataset have null value for field1, if not before join t1 dataset can be filtered based on notNull condition. 检查t2数据集的field1是否为空值,如果没有,则可以基于notNull条件过滤联接t1数据集。 It will reduce t1 size 它将减小t1的大小

  2. If your job is getting only few executors than available one then you have less number of partitions. 如果您的工作只得到很少的执行程序,那么执行的分区数量就更少了。 Simply repartition the dataset, set an optimal number so than you don't endup with large number of partitions or vice versa. 只需对数据集重新分区,设置一个最佳数量,这样就不会出现大量分区,反之亦然。

  3. You can check if partitioning has happened properly (no skewness) by looking at tasks execution time, it should be similar. 您可以通过查看任务执行时间来检查分区是否正确进行(没有偏斜),它应该是相似的。

  4. Check if smaller dataset can be fit in executors memory, broadcast_join can be used. 检查较小的数据集是否适合执行程序的内存,可以使用broadcast_join。

You might like to read - https://github.com/vaquarkhan/Apache-Kafka-poc-and-notes/wiki/Apache-Spark-Join-guidelines-and-Performance-tuning 您可能想阅读-https://github.com/vaquarkhan/Apache-Kafka-poc-and-notes/wiki/Apache-Spark-Join-guidelines-and-Performance-tuning

If almost all the rows in t1 have t1.field1 = null , and the event_date row is numeric (or you convert it to a timestamp), you can first use Apache DataFu to do a ranged join, and filter out the rows in which t1.field1 != t2.field1 afterwards.如果t1 中几乎所有行都有t1.field1 = null ,并且event_date行是数字(或者您将其转换为时间戳),则可以先使用Apache DataFu进行范围连接,并过滤掉t1 所在的行.field1 != t2.field1之后。

The range join code would look like this:范围连接代码如下所示:

t1.joinWithRange("event_date", t2, "start_date", "end_date", 10)

The last argument - 10 - is the decrease factor.最后一个参数 - 10 - 是减少因子。 This does bucketing, as Raphael Roth suggested in his answer.正如Raphael Roth在他的回答中所建议的那样,这确实是分桶。

You can see an example of such a ranged join in the blog post introducing DataFu-Spark .您可以在介绍 DataFu-Spark 的博客文章中看到这种范围连接的示例。

Full disclosure - I am a member of DataFu and wrote the blog post.完全披露 - 我是 DataFu 的成员并撰写了博文。

I assume spark already pushed the not-null filter on t1.field1 , you can verify this in the explain-plan. 我假设spark已经在t1.field1上推送了非null过滤器,您可以在说明计划中进行验证。

I would rather experiment with creating an additional attribute which can be used as an equi-join condition, eg by bucketing. 我宁愿尝试创建一个可以用作等联接条件(例如通过存储)的附加属性。 For example you could create a month attribute. 例如,您可以创建一个month属性。 To do this, you would need to enumerate months in t2 , this is usually done using UDFs. 为此,您需要枚举t2 months ,这通常是使用UDF完成的。 See this SO-question for an example : How to improve broadcast Join speed with between condition in Spark 有关示例,请参见此SO问题: 如何通过Spark中的中间条件提高广播加入速度

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM