简体   繁体   English

PySpark:如果在第二个 dataframe 中找不到列值,则将行从一个 dataframe 移到另一个

[英]PySpark: Moving rows from one dataframe into another if column values are not found in second dataframe

I have two spark dataframes with similar schemas: DF1:我有两个具有相似模式的 spark 数据帧:DF1:

id       category  flag
123abc   type 1     1 
456def   type 1     1
789ghi   type 2     0
101jkl   type 3     0

Df2: DF2:

id       category  flag
123abc   type 1     1 
456def   type 1     1
789ghi   type 2     1
101xyz   type 3     0

DF1 has more data than DF2 so I cannot replace it. DF1 的数据比 DF2 多,所以我无法替换它。 However, DF2 will have ids not found in DF1, as well as several IDs with more accurate flag data.但是,DF2 会有 DF1 没有的 id,还有几个 flag 数据更准确的 id。 This means there there are two situations that I need resolved:这意味着我需要解决两种情况:

  1. 789ghi has a different flag and needs to overwrite the 789ghi in DF1. 789ghi有不同的标志,需要覆盖 DF1 中的 789ghi。
  2. 101xyz is not found in DF1 and needs to be moved over 101xyz在DF1中找不到,需要移过来

Each dataframe is millions of rows, so I am looking for an efficient way to perform this operation.每个 dataframe 都是数百万行,所以我正在寻找一种有效的方法来执行此操作。 I am not sure if this is a situation that requires an outer join or anti-join.我不确定这是需要外部连接还是反连接的情况。

You can union the two dataframes and keep the first record for each id.您可以合并两个数据框并为每个 id 保留第一条记录。

from functools import reduce
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import monotonically_increasing_id, col

df = reduce(DataFrame.unionByName,[df2,df1])

df = df.withColumn('row_num',monotonically_increasing_id())

window = Window.partitionBy("id").orderBy('row_num')

df = (df.withColumn('rank', rank().over(window))
        .filter(col('rank') == 1)).drop('rank','row_num')

Output Output

+------+--------+----+
|    id|category|flag|
+------+--------+----+
|101jkl|  type 3|   0|
|101xyz|  type 3|   0|
|123abc|  type 1|   1|
|456def|  type 1|   1|
|789ghi|  type 2|   1|
+------+--------+----+

Option 1: I would find ids in df1 not in df2 and put them into a subset df I would then union the subset with df2.选项 1:我会在 df1 而不是 df2 中找到 id,然后将它们放入子集 df 中,然后将子集与 df2 合并。

Or或者

Option 2: Find elements in df1 that are in df2 and drop those rows and then union df2.选项 2:在 df1 中查找 df2 中的元素并删除这些行,然后合并 df2。 The approach I take would obviously be based on which is less expensive computationally.我采用的方法显然是基于计算成本较低的方法。

Option 1 code选项 1 代码

s=df1.select('id').subtract(df2.select('id')).collect()[0][0]

df2.union(df1.filter(col('id')==s)).show()

Outcome结果

+------+--------+----+
|    id|category|flag|
+------+--------+----+
|123abc|  type 1|   1|
|456def|  type 1|   1|
|789ghi|  type 2|   1|
|101xyz|  type 3|   0|
|101jkl|  type 3|   0|
+------+--------+----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一个数据框中的值从DataFrame中选择行,并根据第二个DataFrame使用值更新其中一个列 - Select rows from a DataFrame based on a values in another dataframe and updating one of the column with values according to the second DataFrame 将行从一列移动到另一列以及 pandas DataFrame 中的相应值 - Moving rows from one column to another along with respective values in pandas DataFrame 通过将另一列与第二个DataFrame进行比较,替换一列中的值 - Replace values from one column by comparing another column to a second DataFrame 将 dataframe 中的值添加到另一个 dataframe pyspark 中的列 - Adding values from a dataframe to a column in another dataframe pyspark 在 pyspark 中检查另一列 dataframe 中的一列的值 - Checking the values of one column in a column in another dataframe in pyspark 删除所有列中的值与另一个 pyspark dataframe 相似的所有行 - remove all rows with the values in all column similar with another pyspark dataframe PySpark DataFrame - 从另一个 dataframe 创建一个列 - PySpark DataFrame - Create a column from another dataframe 从另一个 DataFrame 将列添加到 Pyspark DataFrame - Add column to Pyspark DataFrame from another DataFrame PySpark:根据列条件使用来自另一个行的行创建子集数据框 - PySpark: Create subset dataframe with rows from another based on a column condition 如何用另一个值过滤 pyspark dataframe 中的行? - How to filter rows in a pyspark dataframe with values from another?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM