如何在行级别比较两个海量火花数据帧并打印差异

Question

I have two very large spark dataframes.我有两个非常大的火花数据框。 I want to compare it on row level and print only the differences我想在行级别比较它并只打印差异

eg:例如：

df1= firstname:abc lastname:xyz company:123

df2= firstname:abc lastname:xyz company:456

expected output- diff= company(df1):123 company(df2):456预期产出- diff= company(df1):123 company(df2):456

Answer 1

As of me,there is no optimal solution for the problem you have described.就我而言，您所描述的问题没有最佳解决方案。 Because, a difference between dataframes can be found only when you have a column/reference on which both the dataframes could be joined.因为，只有当您拥有可以连接两个数据框的column/reference时，才能找到数据框之间的差异。

With that note, one approach will be to use subtract function find the difference which helps to some extent.有了这个注意，一种方法是使用subtract函数找到在某种程度上有所帮助的差异。

>>> df_1.show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
|  abc|  xyz|  123|
+-----+-----+-----+

>>> df_2.show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
|  abc|  xyz|  456|
+-----+-----+-----+

>>> df_1.select('*').subtract(df_2.select('*')).show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
|  abc|  xyz|  123|
+-----+-----+-----+

>>> df_2.select('*').subtract(df_1.select('*')).show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
|  abc|  xyz|  456|
+-----+-----+-----+

Answer 2

I think you are looking for except我想你正在寻找除了

df1.except(df2)

will return rows in df1 not in df2.将返回 df1 中而不是 df2 中的行。

如何在行级别比较两个海量火花数据帧并打印差异

问题描述

2 个解决方案

解决方案1
0 2019-03-21 10:13:23

解决方案2
0 2019-03-21 12:09:11

如何在行级别比较两个海量火花数据帧并打印差异

问题描述

2 个解决方案

解决方案1 0 2019-03-21 10:13:23

解决方案2 0 2019-03-21 12:09:11

解决方案1
0 2019-03-21 10:13:23

解决方案2
0 2019-03-21 12:09:11