简体   繁体   中英

How to compare two CSV files using pySpark and validating exist or not

So, I have a input.csv something like this:

First_Name  Last_Name   Birthdate   Gender  Email_ID        Mobile
Smit        Will        21-04-1974  M       da1@gmail.com   5224521452
Bob         Builder     14-03-1992  M       ad4@gmail.com   2452586253

And Database.csv with few more records to it:

First_Name  Last_Name   Birthdate   Gender  Email_ID        Mobile
Bob         Micheles    10-04-1982  M       ya4@gmail.com   7845214525
Will        Smith       21-04-1974  M       da1@gmail.com   9874521452
Emma        Watson      21-08-1989  F       emma@gmail.com  5748214563
Emma        Smit        21-08-1999  F       da1@gmail.com   9874521452
bob         robison     14-03-1992  M       za@gmail.com    2452586253

df_DataBase = spark.read.csv("DataBase.csv",inferSchema=True,header=True) My expected out is:

  1. Bob Builder is the same as that of Bob robison as only his Last_Name and Email_ID are different
  2. Smit Will and Will Smith are the same as only the Names and the mobile number is different. and finally print the if they exist or not in the existing input file like this:

预期产出

NOTE: The person is not same when email, phone and birthdate don't match.

Thus using pyspark if we can achieve this I would be great.

You can try something like below:

ip = spark.read.csv("input.csv")
db = spark.read.csv("database.csv")
#condition if person is same
person_exists = [((col('a.Email_id') == col('b.Email_id')) | (col('a.Mobile') == col('b.Mobile')) | (col('a.Birthdate') == col('b.Birthdate'))) ]

#people existing in db
existing_persons = 
ip.alias('a').join(db.alias('b'),person_exists,"inner").select([col('a.'+x) for x in a.columns])

#people not existing in db
non_existing = ip.subtract(existing_persons)

#add a column to indicate if same person or not
existing_persons = existing_persons.withColumn('Same_Person',lit('Yes'))
non_existing = non_existing.withColumn('Same_Person',lit('No'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM