Let's say that I have these two dataframes
df1 : | Name| Surname | email
John Smith JohnSmith@gmail.com
Jake Smith JakeSmith@gmail.com
Anna Hendrix Anna1994@protonmail.com
Kale Kinderstone Kinder@hotmail.com
George Hiddleston GH@tonmail.com
Patrick Huston Huston1990@yahoomail.com
df2 : | Name| Surname | email
John Smith JSmith@ymail.com
Hannah Montana HMontana@ymail.com
Anna Hendrix AHendrix@ymail.com
Kale Kinderstone KKinderstone@ymail.com
Ivan Gaganovitch IG@ymail.com
Florence Jekins FJekins@ymail.com
What I want to do is replacing some particular emails without touching the rest of the data. So, the final product I want to make is
df3 : | Name| Surname | email
John Smith JSmith@ymail.com
Jake Smith JakeSmith@gmail.com
Anna Hendrix AHendrix@ymail.com
Kale Kinderstone KKinderstone@ymail.com
George Hiddleston GH@tonmail.com
Patrick Huston Huston1990@yahoomail.com
At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.
Join the dataframes, but use alias
on them. Then you will be able to choose between columns of the same name.
df3 = (df1.alias('a')
.join(df2.alias('b'),['Name', 'Surname'], 'left')
.select(
'Name',
'Surname',
F.coalesce('b.email', 'a.email').alias('email')
)
)
df3.show()
# +-------+-----------+--------------------+
# | Name| Surname| email|
# +-------+-----------+--------------------+
# | Anna| Hendrix| AHendrix@ymail.com|
# | Jake| Smith| JakeSmith@gmail.com|
# | John| Smith| JSmith@ymail.com|
# |Patrick| Huston|Huston1990@yahoom...|
# | George| Hiddleston| GH@tonmail.com|
# | Kale|Kinderstone|KKinderstone@ymai...|
# +-------+-----------+--------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.