I am trying to stack two dataframes (with unionByName()
) and, then, dropping duplicate entries (with drop_duplicates()
). Can I trust that unionByName()
will preserve the order of the rows, ie, that df1.unionByName(df2)
will always produce a dataframe whose first N rows are df1
's? Because, if so, when applying drop_duplicates()
, df1
's row would always be preserved, which is the behaviour I want.
UnionByName
will not guarantee that you will have your records ranked first from df1
and then from df2
. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority
column to each DataFrame, then unionByName() and use the row_number()
analytical function to sort by priority
within that ID
and then select the one with the higher priority
(in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.