简体   繁体   中英

Spark join/groupby datasets take a lot time

I have 2 datasets(tables) with 35kk+ rows.

I try to join (or group by ) this datasets by some id. (in common it will be one-to-one)

But this operation takes a lot time: 25+ h .

Filters only works fine: ~20 mins .

Env : emr-5.3.1

Hadoop distribution :Amazon

Applications :Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2

Instance type: m3.xlarge

Code ( groupBy ):

Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")

Code ( join ):

...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))

Also I found this message in EMR logs:

HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.

There Might be a possibility of Data getting Skewed. We faced this. Check your joining column. This happens mostly if your joining column has NULLS

Check Data Stored pattern with :

select joining_column, count(joining_col) from <tablename>
group by joining_col

This will give you an idea that whether the data in your joining column is Evenly distributed

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM