Drop duplicates for each partition

Question

origin data

cls, id  
----
a, 1
a, 1
----
b, 3
b, 3
b, 4

expected output

cls, id  
----
a, 1
----
b, 3
b, 4

id can be duplicates only in same cls, It means same id do not exist across clses.

In that case.

df.dropDuplicates($id)

will shuffle across all partitions to check duplicates over cls. and repartitioned to 200(default value)

Now, How can I run dropDuplicates for each partition seperately to reduce computing cost?

something like

df.foreachPartition(_.dropDuplicates())

Answer 1

You're probably after something like this:

val distinct = df.mapPartitions(it => {
    val set = Set();
    while (it.hasNext) {
        set += it.next()
    }
    return set.iterator
});

Answer 2

Not not a with set. In fact Set is too dangerous if the size of the data is huge. One option that you can think of is adding mapPartitionsWithIndex and add the index as an output iterator. This way in your DF, the partition index exist. Later, apply drop duplicates by passing partition number and the other key. Ideally, for the combination of the key and map partition the duplicate records get removed.

Drop duplicates for each partition

Question

2 answers

solution1
1 2017-02-17 06:23:02

solution2
0 2021-04-10 16:35:44

Drop duplicates for each partition

Question

2 answers

solution1 1 2017-02-17 06:23:02

solution2 0 2021-04-10 16:35:44

solution1
1 2017-02-17 06:23:02

solution2
0 2021-04-10 16:35:44