I have two very large spark dataframes. I want to compare it on row level and print only the differences
eg:
df1= firstname:abc lastname:xyz company:123
df2= firstname:abc lastname:xyz company:456
expected output- diff= company(df1):123 company(df2):456
As of me,there is no optimal solution for the problem you have described. Because, a difference between dataframes can be found only when you have a column/reference
on which both the dataframes could be joined.
With that note, one approach will be to use subtract
function find the difference which helps to some extent.
>>> df_1.show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
| abc| xyz| 123|
+-----+-----+-----+
>>> df_2.show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
| abc| xyz| 456|
+-----+-----+-----+
>>> df_1.select('*').subtract(df_2.select('*')).show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
| abc| xyz| 123|
+-----+-----+-----+
>>> df_2.select('*').subtract(df_1.select('*')).show()
+-----+-----+-----+
|fname|lname|cmpny|
+-----+-----+-----+
| abc| xyz| 456|
+-----+-----+-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.