简体   繁体   English

Spark join / groupby数据集需要很多时间

[英]Spark join/groupby datasets take a lot time

I have 2 datasets(tables) with 35kk+ rows. 我有2行35kk +行的数据集(表)。

I try to join (or group by ) this datasets by some id. 我尝试通过一些ID 加入 (或分组 )该数据集。 (in common it will be one-to-one) (通常是一对一的)

But this operation takes a lot time: 25+ h . 但是此操作需要很多时间: 25+ h

Filters only works fine: ~20 mins . 过滤器只能正常工作: 〜20分钟

Env : emr-5.3.1 环境 :emr-5.3.1

Hadoop distribution :Amazon Hadoop发行版 :Amazon

Applications :Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2 应用程序 :Ganglia 3.7.2,Spark 2.1.0,Zeppelin 0.6.2

Instance type: m3.xlarge 实例类型: m3.xlarge

Code ( groupBy ): 代码( groupBy ):

Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")

Code ( join ): 代码( join ):

...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))

Also I found this message in EMR logs: 我也在EMR日志中找到此消息:

HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.

There Might be a possibility of Data getting Skewed. 数据可能会歪斜。 We faced this. 我们面对这个。 Check your joining column. 检查您的加入专栏。 This happens mostly if your joining column has NULLS 大多数情况是在您的联接列为NULL时发生的

Check Data Stored pattern with : 使用以下命令检查数据存储模式:

select joining_column, count(joining_col) from <tablename>
group by joining_col

This will give you an idea that whether the data in your joining column is Evenly distributed 这将使您了解联接列中的数据是否均匀分布

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM