在 Pyspark 数据帧上左连接和应用案例逻辑

Question

I am learning to code PySpark.我正在学习编码 PySpark。 I am able join two dataframes by building SQL like views on top them using .createOrReplaceTempView() and get the output I want.我可以通过构建 SQL 来加入两个数据框，就像使用.createOrReplaceTempView()在它们上面的views一样，并获得我想要的 output。 However I want to learn how to do the same by operating directly on the dataframe instead of creating views .但是我想通过直接在 dataframe 上操作而不是创建views来学习如何做同样的事情。

This is my code这是我的代码

df1.createOrReplaceTempView('left_table')
df2.createOrReplaceTempView('right_table')

    spark.sql('''
    select
    l.*,
    CASE WHEN r.id IS NULL THEN current_timestamp() ELSE r.timestamp END ts,
    from
    left_table l 
    left join 
    right_table r
    on l.id = r.id 
    ''').show()

For matching id I want the timestamp column to be taken from the right table.为了匹配id ，我希望从右表中获取timestamp列。 For id that is available only in left table I want to use system timestamp using current_timestamp() for the final column value.对于仅在左表中可用的id ，我想使用系统时间戳，使用current_timestamp()作为最终列值。

How do i achieve this by operating directly on dataframes df1 and df2 instead of building views?我如何通过直接在数据帧df1和df2上操作而不是构建视图来实现这一点？

Answer 1

You can do a left join and then coalesce the NULL timestamps with the current timestamp:您可以进行左连接，然后将 NULL 时间戳与当前时间戳合并：

import pyspark.sql.functions as F

df1.join(df2, 'id', 'left') \
   .drop(*[col for col in df2.columns if col != 'timestamp']) \
   .withColumn('timestamp', F.coalesce(F.col('timestamp'), F.current_timestamp()))

在 Pyspark 数据帧上左连接和应用案例逻辑

问题描述

1 个解决方案

解决方案1
1 2020-11-26 18:01:43

在 Pyspark 数据帧上左连接和应用案例逻辑

问题描述

1 个解决方案

解决方案1 1 2020-11-26 18:01:43

解决方案1
1 2020-11-26 18:01:43