[英]How to process a dataframe column based on another dataframe column in PySpark?
Let's say that I have these two dataframes假设我有这两个数据框
df1 : | Name| Surname | email
John Smith JohnSmith@gmail.com
Jake Smith JakeSmith@gmail.com
Anna Hendrix Anna1994@protonmail.com
Kale Kinderstone Kinder@hotmail.com
George Hiddleston GH@tonmail.com
Patrick Huston Huston1990@yahoomail.com
df2 : | Name| Surname | email
John Smith JSmith@ymail.com
Hannah Montana HMontana@ymail.com
Anna Hendrix AHendrix@ymail.com
Kale Kinderstone KKinderstone@ymail.com
Ivan Gaganovitch IG@ymail.com
Florence Jekins FJekins@ymail.com
What I want to do is replacing some particular emails without touching the rest of the data.我想做的是在不触及其余数据的情况下替换一些特定的电子邮件。 So, the final product I want to make is所以,我想做的最终产品是
df3 : | Name| Surname | email
John Smith JSmith@ymail.com
Jake Smith JakeSmith@gmail.com
Anna Hendrix AHendrix@ymail.com
Kale Kinderstone KKinderstone@ymail.com
George Hiddleston GH@tonmail.com
Patrick Huston Huston1990@yahoomail.com
At the beginning, I tried joining them by concatenating the names and using the column as key, but then I got stuck on how to process the column and then how to remove the df2 data.一开始,我尝试通过连接名称并使用列作为键来加入它们,但后来我陷入了如何处理列以及如何删除 df2 数据的问题。
Join the dataframes, but use alias
on them.加入数据框,但在它们上使用alias
。 Then you will be able to choose between columns of the same name.然后,您将能够在同名的列之间进行选择。
df3 = (df1.alias('a')
.join(df2.alias('b'),['Name', 'Surname'], 'left')
.select(
'Name',
'Surname',
F.coalesce('b.email', 'a.email').alias('email')
)
)
df3.show()
# +-------+-----------+--------------------+
# | Name| Surname| email|
# +-------+-----------+--------------------+
# | Anna| Hendrix| AHendrix@ymail.com|
# | Jake| Smith| JakeSmith@gmail.com|
# | John| Smith| JSmith@ymail.com|
# |Patrick| Huston|Huston1990@yahoom...|
# | George| Hiddleston| GH@tonmail.com|
# | Kale|Kinderstone|KKinderstone@ymai...|
# +-------+-----------+--------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.