[英]How to compare two columns in two different dataframes in pyspark
i want to compare "pitid" in one dataframe1 with "pitid" of another dataframe2 and want to extract the rows that are missing in dataframe1. 我想将一个dataframe1中的“ pitid”与另一个dataframe2中的“ pitid”进行比较,并想提取dataframe1中缺少的行。
dataframe1: dataframe1:
| id|marks|name| pitid|
+---+-----+----+-----------+
| 1| 1| FR| 1496875194|
| 2| 1| US| -744211593|
| 5| 2| DE|-1433680238|
| 4| 1| DE| -366408878|
| 3| 3| DE| 526286357|
+---+-----+----+-----------+
dataframe2: dataframe2:
| id|marks|name| pitid|
+---+-----+----+-----------+
| 1| 1| FR| 1496875194|
| 7| 9| HY| -816101137|
| 6| 5| FE| 1044793796|
| 2| 1| US| -744211593|
| 5| 2| DE|-1433680238|
| 4| 1| DE| -366408878|
| 3| 3| DE| 526286357|
+---+-----+----+-----------+
expected output: 预期输出:
| 7| 9| HY| -816101137|
| 6| 5| FE| 1044793796|
You can use joins
您可以使用
joins
val diff = df2.join(df1,df2.col("pitid")!=df1.col("pitid"),"left")
If values of all columns will be same in both dataframe then you can use except
如果所有列的值将在这两个数据帧是相同的,那么你可以使用
except
df2.subtract(df1)
Both will gives records in dataframe2 but not in dataframe1 两者都将在dataframe2中提供记录,但不会在dataframe1中提供记录
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.