简体   繁体   中英

How to compare two columns in two different dataframes in pyspark

i want to compare "pitid" in one dataframe1 with "pitid" of another dataframe2 and want to extract the rows that are missing in dataframe1.

dataframe1:

 | id|marks|name|      pitid|
+---+-----+----+-----------+
|  1|    1|  FR| 1496875194|
|  2|    1|  US| -744211593|
|  5|    2|  DE|-1433680238|
|  4|    1|  DE| -366408878|
|  3|    3|  DE|  526286357|
+---+-----+----+-----------+

dataframe2:

| id|marks|name|      pitid|
+---+-----+----+-----------+
|  1|    1|  FR| 1496875194|
|  7|    9|  HY| -816101137|
|  6|    5|  FE| 1044793796|
|  2|    1|  US| -744211593|
|  5|    2|  DE|-1433680238|
|  4|    1|  DE| -366408878|
|  3|    3|  DE|  526286357|
+---+-----+----+-----------+

expected output:

|  7|    9|  HY| -816101137|
|  6|    5|  FE| 1044793796|

You can use joins

val diff = df2.join(df1,df2.col("pitid")!=df1.col("pitid"),"left")

If values of all columns will be same in both dataframe then you can use except

df2.subtract(df1)

Both will gives records in dataframe2 but not in dataframe1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM