如何根据 PySpark 中的另一个数据框列处理数据框列？

Question

Let's say that I have these two dataframes假设我有这两个数据框

df1 :   | Name|    Surname      | email   
          John      Smith         JohnSmith@gmail.com
          Jake      Smith         JakeSmith@gmail.com 
          Anna      Hendrix       Anna1994@protonmail.com      
          Kale      Kinderstone   Kinder@hotmail.com
         George     Hiddleston    GH@tonmail.com
        Patrick     Huston        Huston1990@yahoomail.com


df2 :   | Name|    Surname      | email   
          John      Smith         JSmith@ymail.com
         Hannah     Montana       HMontana@ymail.com 
          Anna      Hendrix       AHendrix@ymail.com      
          Kale      Kinderstone   KKinderstone@ymail.com
         Ivan       Gaganovitch   IG@ymail.com
        Florence     Jekins       FJekins@ymail.com

What I want to do is replacing some particular emails without touching the rest of the data.我想做的是在不触及其余数据的情况下替换一些特定的电子邮件。 So, the final product I want to make is所以，我想做的最终产品是

df3 :   | Name|    Surname      | email   
          John      Smith         JSmith@ymail.com
          Jake      Smith         JakeSmith@gmail.com 
          Anna      Hendrix       AHendrix@ymail.com      
          Kale      Kinderstone   KKinderstone@ymail.com
         George     Hiddleston    GH@tonmail.com
        Patrick     Huston        Huston1990@yahoomail.com

At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.一开始，我尝试通过连接名称并使用列作为键来加入它们，但后来我陷入了如何处理列以及如何删除 df2 数据的问题。

Answer 1

Join the dataframes, but use alias on them.加入数据框，但在它们上使用alias 。 Then you will be able to choose between columns of the same name.然后，您将能够在同名的列之间进行选择。

df3 = (df1.alias('a')
    .join(df2.alias('b'),['Name', 'Surname'], 'left')
    .select(
        'Name',
        'Surname',
        F.coalesce('b.email', 'a.email').alias('email')
    )
)
df3.show()
# +-------+-----------+--------------------+
# |   Name|    Surname|               email|
# +-------+-----------+--------------------+
# |   Anna|    Hendrix|  AHendrix@ymail.com|
# |   Jake|      Smith| JakeSmith@gmail.com|
# |   John|      Smith|    JSmith@ymail.com|
# |Patrick|     Huston|Huston1990@yahoom...|
# | George| Hiddleston|      GH@tonmail.com|
# |   Kale|Kinderstone|KKinderstone@ymai...|
# +-------+-----------+--------------------+

如何根据 PySpark 中的另一个数据框列处理数据框列？

问题描述

1 个解决方案

解决方案1
0 2022-07-12 12:50:32

如何根据 PySpark 中的另一个数据框列处理数据框列？

问题描述

1 个解决方案

解决方案1 0 2022-07-12 12:50:32

解决方案1
0 2022-07-12 12:50:32