[英]Left Join and apply case logic on Pyspark Dataframes
I am learning to code PySpark.我正在学习编码 PySpark。 I am able join two dataframes by building SQL like
views
on top them using .createOrReplaceTempView()
and get the output I want.我可以通过构建 SQL 来加入两个数据框,就像使用
.createOrReplaceTempView()
在它们上面的views
一样,并获得我想要的 output。 However I want to learn how to do the same by operating directly on the dataframe instead of creating views
.但是我想通过直接在 dataframe 上操作而不是创建
views
来学习如何做同样的事情。
This is my code这是我的代码
df1.createOrReplaceTempView('left_table')
df2.createOrReplaceTempView('right_table')
spark.sql('''
select
l.*,
CASE WHEN r.id IS NULL THEN current_timestamp() ELSE r.timestamp END ts,
from
left_table l
left join
right_table r
on l.id = r.id
''').show()
For matching id
I want the timestamp
column to be taken from the right table.为了匹配
id
,我希望从右表中获取timestamp
列。 For id
that is available only in left table I want to use system timestamp using current_timestamp()
for the final column value.对于仅在左表中可用的
id
,我想使用系统时间戳,使用current_timestamp()
作为最终列值。
How do i achieve this by operating directly on dataframes df1
and df2
instead of building views?我如何通过直接在数据帧
df1
和df2
上操作而不是构建视图来实现这一点?
You can do a left join and then coalesce the NULL timestamps with the current timestamp:您可以进行左连接,然后将 NULL 时间戳与当前时间戳合并:
import pyspark.sql.functions as F
df1.join(df2, 'id', 'left') \
.drop(*[col for col in df2.columns if col != 'timestamp']) \
.withColumn('timestamp', F.coalesce(F.col('timestamp'), F.current_timestamp()))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.