按列比较两个 spark 数据帧并获取不匹配的记录

Question

I want to iterate and compare the columns between two spark dataframes and store the mismatch records.我想迭代并比较两个 spark 数据帧之间的列并存储不匹配的记录。

I am getting the mismatch records in dataframe format so i want to store in any variable as dataframe is immutable.我得到 dataframe 格式的不匹配记录，所以我想存储在任何变量中，因为 dataframe 是不可变的。 Please suggest how to store dataframe output as rows and columns in variable or collection.请建议如何将 dataframe output 存储为变量或集合中的行和列。

Var mismatchValues = new ArrayBuffer[String]()

Val columns1 = srcTable_colMismatch.schema.fields.map(_.name.tostring)

Val selectiveDifference = columns1.map(c=> srcTable_colMismatch.select(c, "hash_key","row_num"). exceptAll(tgtTable_colMismatch.select(c, "hash_key","row_num").as(c)))

selectiveDifference.zipWithIndex.foreach{ case (e,i) => if(e.count>0) mismatchValues += sortedMismatchRecords.select("*").as("SRC").join(e.as("dif"), $"SRC.hash_key" === $"dif.hash_key" && "SRC.columns1(i) != e.schema.fields.map(_.name)).select("SRC.*").collect.mkstring(",") }

Val convertedDF = mismatchValues.map(a=> a.toString).toDF()
ConvertedDF show()

Answer 1

Spark can do this for you: Spark 可以为您做到这一点：

df1.union(df2).subtract(df1.intersect(df2))

I strongly discourage you from using a variable to compare unless you have it on good authority that it will fit in memory.我强烈建议您不要使用变量进行比较，除非您有足够的权威证明它适合 memory。

df1.union(df2)\ # create one data set
 .subtract(\ # remove items that match this data frame
  df1.intersect(df2)\ # all items that are in both dataframes
 )

按列比较两个 spark 数据帧并获取不匹配的记录

问题描述

1 个解决方案

解决方案1
0 2022-11-24 14:28:05

按列比较两个 spark 数据帧并获取不匹配的记录

问题描述

1 个解决方案

解决方案1 0 2022-11-24 14:28:05

解决方案1
0 2022-11-24 14:28:05