So, I have a input.csv something like this:
First_Name Last_Name Birthdate Gender Email_ID Mobile
Smit Will 21-04-1974 M da1@gmail.com 5224521452
Bob Builder 14-03-1992 M ad4@gmail.com 2452586253
And Database.csv with few more records to it:
First_Name Last_Name Birthdate Gender Email_ID Mobile
Bob Micheles 10-04-1982 M ya4@gmail.com 7845214525
Will Smith 21-04-1974 M da1@gmail.com 9874521452
Emma Watson 21-08-1989 F emma@gmail.com 5748214563
Emma Smit 21-08-1999 F da1@gmail.com 9874521452
bob robison 14-03-1992 M za@gmail.com 2452586253
df_DataBase = spark.read.csv("DataBase.csv",inferSchema=True,header=True)
My expected out is:
NOTE: The person is not same when email, phone and birthdate don't match.
Thus using pyspark if we can achieve this I would be great.
You can try something like below:
ip = spark.read.csv("input.csv")
db = spark.read.csv("database.csv")
#condition if person is same
person_exists = [((col('a.Email_id') == col('b.Email_id')) | (col('a.Mobile') == col('b.Mobile')) | (col('a.Birthdate') == col('b.Birthdate'))) ]
#people existing in db
existing_persons =
ip.alias('a').join(db.alias('b'),person_exists,"inner").select([col('a.'+x) for x in a.columns])
#people not existing in db
non_existing = ip.subtract(existing_persons)
#add a column to indicate if same person or not
existing_persons = existing_persons.withColumn('Same_Person',lit('Yes'))
non_existing = non_existing.withColumn('Same_Person',lit('No'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.