简体   繁体   中英

drop_duplicates after unionByName

I am trying to stack two dataframes (with unionByName() ) and, then, dropping duplicate entries (with drop_duplicates() ). Can I trust that unionByName() will preserve the order of the rows, ie, that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1 's? Because, if so, when applying drop_duplicates() , df1 's row would always be preserved, which is the behaviour I want.

UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2 . These are distributed and parallel tasks so you definitely can't build on that.

The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).

Take a look at the Scala code below:

val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))

df1WithPriority
 .unionByName(df2WithPriority)
    .withColumn(
      "row_num",
      row_number()
        .over(Window.partitionBy("ID").orderBy(col("priority").asc)
    )
    .where(col("row_num") === lit(1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM