[英]PySpark: Moving rows from one dataframe into another if column values are not found in second dataframe
I have two spark dataframes with similar schemas: DF1:我有两个具有相似模式的 spark 数据帧:DF1:
id category flag
123abc type 1 1
456def type 1 1
789ghi type 2 0
101jkl type 3 0
Df2: DF2:
id category flag
123abc type 1 1
456def type 1 1
789ghi type 2 1
101xyz type 3 0
DF1 has more data than DF2 so I cannot replace it. DF1 的数据比 DF2 多,所以我无法替换它。 However, DF2 will have ids not found in DF1, as well as several IDs with more accurate flag data.
但是,DF2 会有 DF1 没有的 id,还有几个 flag 数据更准确的 id。 This means there there are two situations that I need resolved:
这意味着我需要解决两种情况:
789ghi
has a different flag and needs to overwrite the 789ghi in DF1. 789ghi
有不同的标志,需要覆盖 DF1 中的 789ghi。101xyz
is not found in DF1 and needs to be moved over 101xyz
在DF1中找不到,需要移过来Each dataframe is millions of rows, so I am looking for an efficient way to perform this operation.每个 dataframe 都是数百万行,所以我正在寻找一种有效的方法来执行此操作。 I am not sure if this is a situation that requires an outer join or anti-join.
我不确定这是需要外部连接还是反连接的情况。
You can union the two dataframes and keep the first record for each id.您可以合并两个数据框并为每个 id 保留第一条记录。
from functools import reduce
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import monotonically_increasing_id, col
df = reduce(DataFrame.unionByName,[df2,df1])
df = df.withColumn('row_num',monotonically_increasing_id())
window = Window.partitionBy("id").orderBy('row_num')
df = (df.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)).drop('rank','row_num')
Output Output
+------+--------+----+
| id|category|flag|
+------+--------+----+
|101jkl| type 3| 0|
|101xyz| type 3| 0|
|123abc| type 1| 1|
|456def| type 1| 1|
|789ghi| type 2| 1|
+------+--------+----+
Option 1: I would find ids in df1 not in df2 and put them into a subset df I would then union the subset with df2.选项 1:我会在 df1 而不是 df2 中找到 id,然后将它们放入子集 df 中,然后将子集与 df2 合并。
Or或者
Option 2: Find elements in df1 that are in df2 and drop those rows and then union df2.选项 2:在 df1 中查找 df2 中的元素并删除这些行,然后合并 df2。 The approach I take would obviously be based on which is less expensive computationally.
我采用的方法显然是基于计算成本较低的方法。
Option 1 code选项 1 代码
s=df1.select('id').subtract(df2.select('id')).collect()[0][0]
df2.union(df1.filter(col('id')==s)).show()
Outcome结果
+------+--------+----+
| id|category|flag|
+------+--------+----+
|123abc| type 1| 1|
|456def| type 1| 1|
|789ghi| type 2| 1|
|101xyz| type 3| 0|
|101jkl| type 3| 0|
+------+--------+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.