[英]Find difference of values on two dataframe for each corresponding columns using pyspark
I want to find the differences in the column values of two dataframes when joined using inner join.我想找到使用内连接连接时两个数据框的列值的差异。
df1 has 10 columns, ie. df1 有 10 列,即。 key1,key2 & col1, col2 so on.
key1,key2 & col1, col2 依此类推。 (columns can be more and name can be different) similarly df2 has 10 columns, i,e, key1,key2 & col1, col2 so on.
(列可以更多,名称可以不同)同样 df2 有 10 列,即 key1,key2 & col1, col2 依此类推。
df3 = df1.join(df2, 'df1.key1 == df2.key1 and df1.key2 == df2.key2', 'inner')
now I want to compare corresponding columns of two dataframe df1 and df2 that is already there in the joined df3.现在我想比较在加入的 df3 中已经存在的两个 dataframe df1 和 df2 的对应列。
Now I am looping it for each x,y in zip(df1.columns,df2.columns) and storing in a list unmatchList.append((df3.select(df1.x,df2.y).filter(df1.x <> df2.y)))
现在我为 zip(df1.columns,df2.columns) 中的每个 x,y 循环它并存储在列表中
unmatchList.append((df3.select(df1.x,df2.y).filter(df1.x <> df2.y)))
can I avoid this loop as this is extensively using memory here.我可以避免这个循环吗,因为这里广泛使用 memory。 There are other calculation that I am doing but this is small code snippet that I have presented.
我正在做其他计算,但这是我提供的小代码片段。 Idea behind this is to find out different values in the corresponding columns for matching row of two dataframe.
这背后的想法是在对应的列中找出不同的值,以匹配两个 dataframe 的行。 exceptAll is don't work for this requirement as it find the difference based on position of columns.
exceptAll 不适用于此要求,因为它会根据列的 position 找到差异。 I need to find the difference only when keys of both dataframes matches.
只有当两个数据帧的键匹配时,我才需要找到差异。
df1 df1
key1 key2 col1 col2 col3 col4 col5
k11 k21 1 1 1 1 1
k12 k22 2 2 2 2 2
df2 df2
key1 key2 col1 col2 col3 col4 col5
k11 k21 1 1 2 1 1
k12 k22 2 3 2 3 4
Final output i want is我想要的最终 output 是
key1 key2 col val1 val2
k11 k21 col3 1 2
k12 k22 col2 2 3
k12 k22 col4 2 3
k12 k22 col5 2 4
val1 is to obtain from df1 and val2 is to obtain from df2 val1 是从 df1 获取,val2 是从 df2 获取
The problem here is, if no of columns in a DataFrame is high, performance of the loop, degrades.这里的问题是,如果 DataFrame 中的列没有很高,则循环的性能会降低。 It further results in output memory results.
它进一步导致 output memory 结果。
Instead of storing the results in a list we can use a dataframe and store (append or insert into) the results of each iteration in some hdfs location or hive table.我们可以使用 dataframe 并将每次迭代的结果存储(追加或插入)到某个 hdfs 位置或 hive 表中,而不是将结果存储在列表中。
for x,y in zip(df1.columns,df2.columns)
outputDF=joinedDF.filter(col(x) <> col(y))
.withColumns('key1',lit(key1))
.withColumns('key2',lit(key2))
.withColumns('col',lit(x))
.withColumns('val1',col(x))
.withColumns('val2',col(y))
outputDF.partitionBy(x).coalesce(1).write.mode('append').format('hive').saveAsTable('DB.Table')````
#Another approach can be if no of columns are less (10-15):#
outputDF=outputDF.union(joinedDF.filter(col(x) <> col(y))
.withColumns('key1',lit(key1))
.withColumns('key2',lit(key2))
.withColumns('col',lit(x))
.withColumns('val1',col(x))
.withColumns('val2',col(y)))
outputDF.partitionBy(x).coalesce(1).write.mode('append').format('hive').saveAsTable('DB.Table')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.