Spark join / groupby数据集需要很多时间

Question

I have 2 datasets(tables) with 35kk+ rows. 我有2行35kk +行的数据集（表）。

I try to join (or group by ) this datasets by some id. 我尝试通过一些ID 加入（或分组）该数据集。 (in common it will be one-to-one) （通常是一对一的）

But this operation takes a lot time: 25+ h . 但是此操作需要很多时间： 25+ h 。

Filters only works fine: ~20 mins . 过滤器只能正常工作： 〜20分钟 。

Env : emr-5.3.1 环境：emr-5.3.1

Hadoop distribution :Amazon Hadoop发行版 ：Amazon

Applications :Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2 应用程序 ：Ganglia 3.7.2，Spark 2.1.0，Zeppelin 0.6.2

Instance type: m3.xlarge 实例类型： m3.xlarge

Code ( groupBy ): 代码（ groupBy ）：

Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")

Code ( join ): 代码（ join ）：

...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))

Also I found this message in EMR logs: 我也在EMR日志中找到此消息：

HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.

Answer 1

There Might be a possibility of Data getting Skewed. 数据可能会歪斜。 We faced this. 我们面对这个。 Check your joining column. 检查您的加入专栏。 This happens mostly if your joining column has NULLS 大多数情况是在您的联接列为NULL时发生的

Check Data Stored pattern with : 使用以下命令检查数据存储模式：

select joining_column, count(joining_col) from <tablename>
group by joining_col

This will give you an idea that whether the data in your joining column is Evenly distributed 这将使您了解联接列中的数据是否均匀分布

Spark join / groupby数据集需要很多时间

问题描述

1 个解决方案

解决方案1
1 2018-02-10 17:16:37

Spark join / groupby数据集需要很多时间

问题描述

1 个解决方案

解决方案1 1 2018-02-10 17:16:37

解决方案1
1 2018-02-10 17:16:37