pyspark - 對於每個鍵值，如何獲取所有其他列的第一個和最后一個 notNull 值（基於時間戳）

Question

我有一個具有這樣值的 pyspark 數據框。 我想在 4 個單獨的列中獲取狀態的第一個/最后一個 notNull 值及其相應的時間戳。 對國家列也做同樣的事情。 如底部所述，輸出結果將總共有 9 列。 請幫助我如何在（pyspark 或 SQL）和快速性能中做到這一點。 我有一個巨大的表，其中包含數百萬行和 10 列我需要第一個/最后一個及其時間戳

empId   timestamp      stateID  countryID
1   5/1/2022 10:10am    CA  
1                       CA        USA
1   5/2/2022 11:11pm    CT        USA
1                       NJ        USA
1   5/10/2022 12:12pm             UK
2                       VA        USA
2   5/9/2022 12:15am    TX  
2   5/10/2022 09:09am   CA        USA
3                       NY        USA
3   5/16/2022 09:15pm   MO        Japan
3   5/17/2022 04:04am   AL        USA
3   5/20/2022 07:07pm             UK

在此處輸入圖像描述

Answer 1

我不認為這是最優化的解決方案，但至少使用 spark 函數，如果您使用以下代碼，您將獲得所需的輸出，如您的圖像中提供的數據

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") # for spark 3 

te=spark.read.csv(path="<FilePath>",header=True,inferSchema=True) # replace filepath

te=te.dropna(how="all",subset=["timestamp"]).withColumn("timestamp1",to_timestamp(col("timestamp"), "MM/dd/yyyy hh:mma")) # drop empty timestamps and convert string to timestamp 

from pyspark.sql import Window
from pyspark.sql.functions import col,first,to_timestamp,last,when # required imports

w=Window.partitionBy("empId").orderBy("timestamp1") # window to use later



te2=te.withColumn("first_state",first("stateId",ignorenulls=True).over(w)) \
.withColumn("last_state",last("stateId",ignorenulls=True).over(w)) \
.withColumn("first_country",first("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country",last("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country_timestamp",when(col("countryId")==col("last_country"),col("timestamp1"))) \
.withColumn("first_country_timestamp",when(col("countryId")==col("first_country"),col("timestamp1"))) \
.withColumn("last_state_timestamp",when(col("stateId")==col("last_state"),col("timestamp1"))) \
.withColumn("first_state_timestamp",when(col("stateId")==col("first_state"),col("timestamp1"))).drop(*["timestamp","stateId","countryId"]) # calculation step

te3=te2.groupBy("empId").agg(first("first_state",ignorenulls=True).alias("first_state"), \
                             first("first_state_timestamp",ignorenulls=True).alias("first_stateID_timestamp"), \
                             last("last_state",ignorenulls=True).alias("last_state"), \
                             last("last_state_timestamp",ignorenulls=True).alias("last_stateID_timestamp"), \
                             first("first_country",ignorenulls=True).alias("first_country"), \
                             first("first_country_timestamp",ignorenulls=True).alias("first_countryID_timestamp"), \
                             last("last_country",ignorenulls=True).alias("last_country"), \
                             last("last_country_timestamp",ignorenulls=True).alias("last_countryID_timestamp")) # final df having required output

以下是您將在 te3 中為您提供的數據獲得的輸出

+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|empId|first_state|first_stateID_timestamp|last_state|last_stateID_timestamp|first_country|first_countryID_timestamp|last_country|last_countryID_timestamp|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|    1|         CA|    2022-05-01 10:10:00|        CT|   2022-05-02 23:11:00|          USA|      2022-05-02 23:11:00|          UK|     2022-05-10 12:12:00|
|    2|         TX|    2022-05-09 00:15:00|        CA|   2022-05-10 09:09:00|          USA|      2022-05-10 09:09:00|         USA|     2022-05-10 09:09:00|
|    3|         MO|    2022-05-16 21:15:00|        AL|   2022-05-17 04:04:00|        Japan|      2022-05-16 21:15:00|          UK|     2022-05-20 19:07:00|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+

pyspark - 對於每個鍵值，如何獲取所有其他列的第一個和最后一個 notNull 值（基於時間戳）

問題描述

1 個解決方案

解決方案1
0 2022-06-05 20:05:56

pyspark - 對於每個鍵值，如何獲取所有其他列的第一個和最后一個 notNull 值（基於時間戳）

問題描述

1 個解決方案

解決方案1 0 2022-06-05 20:05:56

解決方案1
0 2022-06-05 20:05:56