简体   繁体   中英

How to process a dataframe column based on another dataframe column in PySpark?

Let's say that I have these two dataframes

df1 :   | Name|    Surname      | email   
          John      Smith         JohnSmith@gmail.com
          Jake      Smith         JakeSmith@gmail.com 
          Anna      Hendrix       Anna1994@protonmail.com      
          Kale      Kinderstone   Kinder@hotmail.com
         George     Hiddleston    GH@tonmail.com
        Patrick     Huston        Huston1990@yahoomail.com


df2 :   | Name|    Surname      | email   
          John      Smith         JSmith@ymail.com
         Hannah     Montana       HMontana@ymail.com 
          Anna      Hendrix       AHendrix@ymail.com      
          Kale      Kinderstone   KKinderstone@ymail.com
         Ivan       Gaganovitch   IG@ymail.com
        Florence     Jekins       FJekins@ymail.com

What I want to do is replacing some particular emails without touching the rest of the data. So, the final product I want to make is

df3 :   | Name|    Surname      | email   
          John      Smith         JSmith@ymail.com
          Jake      Smith         JakeSmith@gmail.com 
          Anna      Hendrix       AHendrix@ymail.com      
          Kale      Kinderstone   KKinderstone@ymail.com
         George     Hiddleston    GH@tonmail.com
        Patrick     Huston        Huston1990@yahoomail.com

At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.

Join the dataframes, but use alias on them. Then you will be able to choose between columns of the same name.

df3 = (df1.alias('a')
    .join(df2.alias('b'),['Name', 'Surname'], 'left')
    .select(
        'Name',
        'Surname',
        F.coalesce('b.email', 'a.email').alias('email')
    )
)
df3.show()
# +-------+-----------+--------------------+
# |   Name|    Surname|               email|
# +-------+-----------+--------------------+
# |   Anna|    Hendrix|  AHendrix@ymail.com|
# |   Jake|      Smith| JakeSmith@gmail.com|
# |   John|      Smith|    JSmith@ymail.com|
# |Patrick|     Huston|Huston1990@yahoom...|
# | George| Hiddleston|      GH@tonmail.com|
# |   Kale|Kinderstone|KKinderstone@ymai...|
# +-------+-----------+--------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM