在 pyspark dataframe 的列中為 null 分配日期值

Question

我有一個 pyspark dataframe：

Location        Month       New_Date    Sales
USA             1/1/2020    1/1/2020    34.56%
COL             1/1/2020    1/1/2020    66.4%
AUS             1/1/2020    1/1/2020    32.98%
NZ              null        null        44.59%
CHN             null        null        21.13%

我從Month列（MM/dd/yyyy 格式）創建New_Date列。 我需要為Month為 null 的行填充New_date值。

這就是我嘗試過的：

df1=df.filter(col('Month').isNull()) \
.withColumn("current_date",current_date()) \
.withColumn("New_date", trunc(col("current_date"), "month"))

但我正在獲取當月的第一個日期。 我需要Month列的第一個日期請建議任何其他方法。

Location        Month       New_Date    Sales
USA             1/1/2020    1/1/2020    34.56%
COL             1/1/2020    1/1/2020    66.4%
AUS             1/1/2020    1/1/2020    32.98%
NZ              null        1/1/2020    44.59%
CHN             null        1/1/2020    21.13%

Answer 1

您可以first使用 function 而不是 window：

from pyspark.sql import functions as F, Window

w = (Window.orderBy("Month")
     .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
     )

df1 = df.withColumn(
    "New_date",
    F.coalesce(F.col("Month"), F.first("Month", ignorenulls=True).over(w))
)

df1.show()
#+--------+--------+--------+------+
#|Location|   Month|New_date| Sales|
#+--------+--------+--------+------+
#|      NZ|    null|1/1/2020|44.59%|
#|     CHN|    null|1/1/2020|21.13%|
#|     USA|1/1/2020|1/1/2020|34.56%|
#|     COL|1/1/2020|1/1/2020| 66.4%|
#|     AUS|1/1/2020|1/1/2020|32.98%|
#+--------+--------+--------+------+

在 pyspark dataframe 的列中為 null 分配日期值

問題描述

1 個解決方案

解決方案1
1 已采納 2022-02-10 08:51:39

在 pyspark dataframe 的列中為 null 分配日期值

問題描述

1 個解決方案

解決方案1 1 已采納 2022-02-10 08:51:39

解決方案1
1 已采納 2022-02-10 08:51:39