简体   繁体   中英

Drop duplicates for each partition

origin data

cls, id  
----
a, 1
a, 1
----
b, 3
b, 3
b, 4

expected output

cls, id  
----
a, 1
----
b, 3
b, 4

id can be duplicates only in same cls, It means same id do not exist across clses.

In that case.

df.dropDuplicates($id) 

will shuffle across all partitions to check duplicates over cls. and repartitioned to 200(default value)

Now, How can I run dropDuplicates for each partition seperately to reduce computing cost?

something like

df.foreachPartition(_.dropDuplicates())

You're probably after something like this:

val distinct = df.mapPartitions(it => {
    val set = Set();
    while (it.hasNext) {
        set += it.next()
    }
    return set.iterator
});

Not not a with set. In fact Set is too dangerous if the size of the data is huge. One option that you can think of is adding mapPartitionsWithIndex and add the index as an output iterator. This way in your DF, the partition index exist. Later, apply drop duplicates by passing partition number and the other key. Ideally, for the combination of the key and map partition the duplicate records get removed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM