简体   繁体   English

如何根据 PySpark 中的另一个数据框列处理数据框列?

[英]How to process a dataframe column based on another dataframe column in PySpark?

Let's say that I have these two dataframes假设我有这两个数据框

df1 :   | Name|    Surname      | email   
          John      Smith         JohnSmith@gmail.com
          Jake      Smith         JakeSmith@gmail.com 
          Anna      Hendrix       Anna1994@protonmail.com      
          Kale      Kinderstone   Kinder@hotmail.com
         George     Hiddleston    GH@tonmail.com
        Patrick     Huston        Huston1990@yahoomail.com


df2 :   | Name|    Surname      | email   
          John      Smith         JSmith@ymail.com
         Hannah     Montana       HMontana@ymail.com 
          Anna      Hendrix       AHendrix@ymail.com      
          Kale      Kinderstone   KKinderstone@ymail.com
         Ivan       Gaganovitch   IG@ymail.com
        Florence     Jekins       FJekins@ymail.com

What I want to do is replacing some particular emails without touching the rest of the data.我想做的是在不触及其余数据的情况下替换一些特定的电子邮件。 So, the final product I want to make is所以,我想做的最终产品是

df3 :   | Name|    Surname      | email   
          John      Smith         JSmith@ymail.com
          Jake      Smith         JakeSmith@gmail.com 
          Anna      Hendrix       AHendrix@ymail.com      
          Kale      Kinderstone   KKinderstone@ymail.com
         George     Hiddleston    GH@tonmail.com
        Patrick     Huston        Huston1990@yahoomail.com

At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.一开始,我尝试通过连接名称并使用列作为键来加入它们,但后来我陷入了如何处理列以及如何删除 df2 数据的问题。

Join the dataframes, but use alias on them.加入数据框,但在它们上使用alias Then you will be able to choose between columns of the same name.然后,您将能够在同名的列之间进行选择。

df3 = (df1.alias('a')
    .join(df2.alias('b'),['Name', 'Surname'], 'left')
    .select(
        'Name',
        'Surname',
        F.coalesce('b.email', 'a.email').alias('email')
    )
)
df3.show()
# +-------+-----------+--------------------+
# |   Name|    Surname|               email|
# +-------+-----------+--------------------+
# |   Anna|    Hendrix|  AHendrix@ymail.com|
# |   Jake|      Smith| JakeSmith@gmail.com|
# |   John|      Smith|    JSmith@ymail.com|
# |Patrick|     Huston|Huston1990@yahoom...|
# | George| Hiddleston|      GH@tonmail.com|
# |   Kale|Kinderstone|KKinderstone@ymai...|
# +-------+-----------+--------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark DataFrame 列基于另一个 DataFrame 值 - Pyspark DataFrame column based on another DataFrame value 如何将pyspark数据帧列中的值与pyspark中的另一个数据帧进行比较 - How to compare values in a pyspark dataframe column with another dataframe in pyspark 如何在pyspark中将数据框列与另一个数据框列进行比较? - How to compare dataframe column to another dataframe column inplace in pyspark? 如何基于列扩展 Pyspark 数据框? - How to expand out a Pyspark dataframe based on column? 插入DataFrame列并根据PySpark或Pandas中的另一列进行排序 - Interpolate a DataFrame column and sort based on another column in PySpark or Pandas 如果列在另一个 Spark Dataframe 中,Pyspark 创建新列 - Pyspark create new column based if a column isin another Spark Dataframe PySpark Dataframe:列基于另一列的存在和值 - PySpark Dataframe: Column based on existence and Value of another column Pyspark - 如何根据数据帧 2 中的列值在数据帧 1 中插入记录 - Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2 如何基于另一列对数据框列进行切片 - How to slice a dataframe column based on another column PySpark DataFrame - 从另一个 dataframe 创建一个列 - PySpark DataFrame - Create a column from another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM