简体   繁体   中英

How to compare columns in two dataframes in Scala (spark)

We are moving data from table1 to table2 . I need to create a reconciliation report if the data in table1 exists in table2 .

Example:

val df1 = """(select col1, col2, col3, col4 from table1)""" 
val df2 = """(select col21,col22,col23,c24 from table2)"""

Now I need to check if the data in table1 exists in table2 and write to a report if it is missing.

Left anti join is the elegant way to filter rows that exist in dataframe1 but does not exist in dataframe2 by comparing one or more columns of two dataframes.
As you are not comfortable with the left anti join solution hence lets go ahead with an alternate way.

Let's assume we have two dataframes having common column names:

val DF1 = Seq(
  ("Ravi", 20),
  ("Kiran", 25),
  ("Gaurav", 30),
  ("Vinay", 35),
  ("Mahesh", 40)
).toDF("name", "age")
val DF2 = Seq(
  ("Ravi", 20),
  ("Mahesh", 40)
).toDF("name", "age")
DF1.except(DF2).show()

输出截图

Also check a beautiful solution given by Tzach Zohar by using left-anti-join-in-spark

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM