PySpark：如果在第二个 dataframe 中找不到列值，则将行从一个 dataframe 移到另一个

Question

I have two spark dataframes with similar schemas: DF1:我有两个具有相似模式的 spark 数据帧：DF1：

id       category  flag
123abc   type 1     1 
456def   type 1     1
789ghi   type 2     0
101jkl   type 3     0

Df2: DF2：

id       category  flag
123abc   type 1     1 
456def   type 1     1
789ghi   type 2     1
101xyz   type 3     0

DF1 has more data than DF2 so I cannot replace it. DF1 的数据比 DF2 多，所以我无法替换它。 However, DF2 will have ids not found in DF1, as well as several IDs with more accurate flag data.但是，DF2 会有 DF1 没有的 id，还有几个 flag 数据更准确的 id。 This means there there are two situations that I need resolved:这意味着我需要解决两种情况：

789ghi has a different flag and needs to overwrite the 789ghi in DF1. 789ghi有不同的标志，需要覆盖 DF1 中的 789ghi。
101xyz is not found in DF1 and needs to be moved over 101xyz在DF1中找不到，需要移过来

Each dataframe is millions of rows, so I am looking for an efficient way to perform this operation.每个 dataframe 都是数百万行，所以我正在寻找一种有效的方法来执行此操作。 I am not sure if this is a situation that requires an outer join or anti-join.我不确定这是需要外部连接还是反连接的情况。

Answer 1

You can union the two dataframes and keep the first record for each id.您可以合并两个数据框并为每个 id 保留第一条记录。

from functools import reduce
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import monotonically_increasing_id, col

df = reduce(DataFrame.unionByName,[df2,df1])

df = df.withColumn('row_num',monotonically_increasing_id())

window = Window.partitionBy("id").orderBy('row_num')

df = (df.withColumn('rank', rank().over(window))
        .filter(col('rank') == 1)).drop('rank','row_num')

Output Output

+------+--------+----+
|    id|category|flag|
+------+--------+----+
|101jkl|  type 3|   0|
|101xyz|  type 3|   0|
|123abc|  type 1|   1|
|456def|  type 1|   1|
|789ghi|  type 2|   1|
+------+--------+----+

Answer 2

Option 1: I would find ids in df1 not in df2 and put them into a subset df I would then union the subset with df2.选项 1：我会在 df1 而不是 df2 中找到 id，然后将它们放入子集 df 中，然后将子集与 df2 合并。

Or或者

Option 2: Find elements in df1 that are in df2 and drop those rows and then union df2.选项 2：在 df1 中查找 df2 中的元素并删除这些行，然后合并 df2。 The approach I take would obviously be based on which is less expensive computationally.我采用的方法显然是基于计算成本较低的方法。

Option 1 code选项 1 代码

s=df1.select('id').subtract(df2.select('id')).collect()[0][0]

df2.union(df1.filter(col('id')==s)).show()

Outcome结果

+------+--------+----+
|    id|category|flag|
+------+--------+----+
|123abc|  type 1|   1|
|456def|  type 1|   1|
|789ghi|  type 2|   1|
|101xyz|  type 3|   0|
|101jkl|  type 3|   0|
+------+--------+----+

PySpark：如果在第二个 dataframe 中找不到列值，则将行从一个 dataframe 移到另一个

问题描述

2 个解决方案

解决方案1
0 2022-02-17 15:00:53

解决方案2
0 2022-02-17 23:37:00

PySpark：如果在第二个 dataframe 中找不到列值，则将行从一个 dataframe 移到另一个

问题描述

2 个解决方案

解决方案1 0 2022-02-17 15:00:53

解决方案2 0 2022-02-17 23:37:00

解决方案1
0 2022-02-17 15:00:53

解决方案2
0 2022-02-17 23:37:00