drop_duplicates after unionByName

Question

I am trying to stack two dataframes (with unionByName() ) and, then, dropping duplicate entries (with drop_duplicates() ). Can I trust that unionByName() will preserve the order of the rows, ie, that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1 's? Because, if so, when applying drop_duplicates() , df1 's row would always be preserved, which is the behaviour I want.

Answer 1

UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2 . These are distributed and parallel tasks so you definitely can't build on that.

The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).

Take a look at the Scala code below:

val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))

df1WithPriority
 .unionByName(df2WithPriority)
    .withColumn(
      "row_num",
      row_number()
        .over(Window.partitionBy("ID").orderBy(col("priority").asc)
    )
    .where(col("row_num") === lit(1))

drop_duplicates after unionByName

Question

1 answers

solution1
1 2022-06-28 12:57:41

drop_duplicates after unionByName

Question

1 answers

solution1 1 2022-06-28 12:57:41

solution1
1 2022-06-28 12:57:41