简体   繁体   English

按列比较两个 spark 数据帧并获取不匹配的记录

[英]Compare two spark dataframes column wise and fetch the mismatch records

I want to iterate and compare the columns between two spark dataframes and store the mismatch records.我想迭代并比较两个 spark 数据帧之间的列并存储不匹配的记录。

I am getting the mismatch records in dataframe format so i want to store in any variable as dataframe is immutable.我得到 dataframe 格式的不匹配记录,所以我想存储在任何变量中,因为 dataframe 是不可变的。 Please suggest how to store dataframe output as rows and columns in variable or collection.请建议如何将 dataframe output 存储为变量或集合中的行和列。

Var mismatchValues = new ArrayBuffer[String]()

Val columns1 = srcTable_colMismatch.schema.fields.map(_.name.tostring)

Val selectiveDifference = columns1.map(c=> srcTable_colMismatch.select(c, "hash_key","row_num"). exceptAll(tgtTable_colMismatch.select(c, "hash_key","row_num").as(c)))

selectiveDifference.zipWithIndex.foreach{ case (e,i) => if(e.count>0) mismatchValues += sortedMismatchRecords.select("*").as("SRC").join(e.as("dif"), $"SRC.hash_key" === $"dif.hash_key" && "SRC.columns1(i) != e.schema.fields.map(_.name)).select("SRC.*").collect.mkstring(",") }

Val convertedDF = mismatchValues.map(a=> a.toString).toDF()
ConvertedDF show()

Spark can do this for you: Spark 可以为您做到这一点:

df1.union(df2).subtract(df1.intersect(df2))

I strongly discourage you from using a variable to compare unless you have it on good authority that it will fit in memory.我强烈建议您不要使用变量进行比较,除非您有足够的权威证明它适合 memory。

df1.union(df2)\ # create one data set
 .subtract(\ # remove items that match this data frame
  df1.intersect(df2)\ # all items that are in both dataframes
 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM