[英]Compare columns from two different dataframes based on id
I have two dataframes to compare, the order of records are different, the name of columns might be different.我有两个要比较的数据框,记录的顺序不同,列的名称可能不同。 Have to compare columns (more than one) based on the unique key (id)必须根据唯一键(id)比较列(多个)
Example: consider cataframes df1 and df2示例:考虑 cataframes df1 和 df2
df1: df1:
+---+-------+-----+
| id|student|marks|
+---+-------+-----+
| 1| Vijay| 23|
| 4| Vithal| 24|
| 2| Ram| 21|
| 3| Rahul| 25|
+---+-------+-----+
df2: df2:
+-----+--------+------+
|newId|student1|marks1|
+-----+--------+------+
| 3| Rahul| 25|
| 2| Ram| 23|
| 1| Vijay| 23|
| 4| Vithal| 24|
+-----+--------+------+
Here based on id
and newId
, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks这里根据id
和newId
,我需要比较值 studentName 和 Marks,并且需要检查具有相同 id 的学生是否具有相同的名称和标记
In this example student with id 2
has 21
marks but in df2 23
marks在此示例中,id 为2
的学生有21
分,但在 df2 中为23
分
df1.exceptAll(df2).show()
// +---+-------+-----+
// | id|student|marks|
// +---+-------+-----+
// | 2| Ram| 21|
// +---+-------+-----+
I think diff
will give the result you are looking for.我认为diff
会给出你正在寻找的结果。
scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.