简体   繁体   English

使用 pyspark 查找每个对应列的两个 dataframe 值的差异

[英]Find difference of values on two dataframe for each corresponding columns using pyspark

I want to find the differences in the column values of two dataframes when joined using inner join.我想找到使用内连接连接时两个数据框的列值的差异。

df1 has 10 columns, ie. df1 有 10 列,即。 key1,key2 & col1, col2 so on. key1,key2 & col1, col2 依此类推。 (columns can be more and name can be different) similarly df2 has 10 columns, i,e, key1,key2 & col1, col2 so on. (列可以更多,名称可以不同)同样 df2 有 10 列,即 key1,key2 & col1, col2 依此类推。

df3 = df1.join(df2, 'df1.key1 == df2.key1 and df1.key2 == df2.key2', 'inner')

now I want to compare corresponding columns of two dataframe df1 and df2 that is already there in the joined df3.现在我想比较在加入的 df3 中已经存在的两个 dataframe df1 和 df2 的对应列。

Now I am looping it for each x,y in zip(df1.columns,df2.columns) and storing in a list unmatchList.append((df3.select(df1.x,df2.y).filter(df1.x <> df2.y)))现在我为 zip(df1.columns,df2.columns) 中的每个 x,y 循环它并存储在列表中unmatchList.append((df3.select(df1.x,df2.y).filter(df1.x <> df2.y)))

can I avoid this loop as this is extensively using memory here.我可以避免这个循环吗,因为这里广泛使用 memory。 There are other calculation that I am doing but this is small code snippet that I have presented.我正在做其他计算,但这是我提供的小代码片段。 Idea behind this is to find out different values in the corresponding columns for matching row of two dataframe.这背后的想法是在对应的列中找出不同的值,以匹配两个 dataframe 的行。 exceptAll is don't work for this requirement as it find the difference based on position of columns. exceptAll 不适用于此要求,因为它会根据列的 position 找到差异。 I need to find the difference only when keys of both dataframes matches.只有当两个数据帧的键匹配时,我才需要找到差异。

df1 df1

key1 key2 col1 col2 col3 col4 col5

k11  k21   1    1    1    1    1

k12  k22   2    2    2    2    2

df2 df2

key1 key2 col1 col2 col3 col4 col5

k11  k21   1    1    2    1    1

k12  k22   2    3    2    3    4

Final output i want is我想要的最终 output 是

key1 key2 col  val1 val2

k11  k21  col3 1    2

k12  k22  col2 2    3

k12  k22  col4 2    3

k12  k22  col5 2    4

val1 is to obtain from df1 and val2 is to obtain from df2 val1 是从 df1 获取,val2 是从 df2 获取

The problem here is, if no of columns in a DataFrame is high, performance of the loop, degrades.这里的问题是,如果 DataFrame 中的列没有很高,则循环的性能会降低。 It further results in output memory results.它进一步导致 output memory 结果。

Instead of storing the results in a list we can use a dataframe and store (append or insert into) the results of each iteration in some hdfs location or hive table.我们可以使用 dataframe 并将每次迭代的结果存储(追加或插入)到某个 hdfs 位置或 hive 表中,而不是将结果存储在列表中。

for x,y in zip(df1.columns,df2.columns)
    outputDF=joinedDF.filter(col(x) <> col(y))
                 .withColumns('key1',lit(key1))
                 .withColumns('key2',lit(key2))
                 .withColumns('col',lit(x))
                 .withColumns('val1',col(x))
                 .withColumns('val2',col(y))

    outputDF.partitionBy(x).coalesce(1).write.mode('append').format('hive').saveAsTable('DB.Table')````

#Another approach can be if no of columns are less (10-15):#
    outputDF=outputDF.union(joinedDF.filter(col(x) <> col(y))
                 .withColumns('key1',lit(key1))
                 .withColumns('key2',lit(key2))
                 .withColumns('col',lit(x))
                 .withColumns('val1',col(x))
                 .withColumns('val2',col(y)))

outputDF.partitionBy(x).coalesce(1).write.mode('append').format('hive').saveAsTable('DB.Table')


暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 pyspark dataframe 数组列中每个数组的值替换为其对应的 id - Replace values of each array in pyspark dataframe array column by their corresponding ids pyspark 查询另一列对应值的列值差异 - pyspark query to find the difference in column value for corresponding values in another column PYSPARK:如何在pyspark数据框中找到两列的余弦相似度? - PYSPARK: How to find cosine similarity of two columns in a pyspark dataframe? 展开两列 Excel 列,每列有多个对应值 - Expand two Excel columns, each with multiple corresponding values 如何通过 pyspark 中的列来查找第一个值和最后一个值之间的差异 - How to find difference between first and last values partitionby a columns in pyspark 在两个数据框列之间显示唯一值-pyspark - display unique values between two dataframe columns - pyspark 如何找到 dataframe 中两列之间的差异,列中的任一行在 Python 中具有 -ve 和/或 +ve 值 - How to find the difference between two columns in a dataframe with either rows in the column have -ve and/or +ve values in Python 在熊猫数据框的两列中找到相等的值 - find the values that are equal in two columns of pandas dataframe 如何在 pandas dataframe 的每一行中的选定列中找到两个最低值? - How do I find the two lowest values across selected columns in each row of a pandas dataframe? 使用 pyspark 计算 dataframe 每一行中的总值 - Count total values in each row of dataframe using pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM