pyspark - 对于每个键值，如何获取所有其他列的第一个和最后一个 notNull 值（基于时间戳）

Question

I have a pyspark dataframe with values like this.我有一个具有这样值的 pyspark 数据框。 I want to get first/last notNull value of state and their corresponding timestamps in 4 separate columns.我想在 4 个单独的列中获取状态的第一个/最后一个 notNull 值及其相应的时间戳。 Do similarly for country column also.对国家列也做同样的事情。 Output result will have total 9 columns as mentioned in the bottom.如底部所述，输出结果将总共有 9 列。 Please help me how to do this in (pyspark or SQL) and fast performance.请帮助我如何在（pyspark 或 SQL）和快速性能中做到这一点。 I have a HUGE table with millions of rows and 10 columns i need to this first/last and their timestamps我有一个巨大的表，其中包含数百万行和 10 列我需要第一个/最后一个及其时间戳

empId   timestamp      stateID  countryID
1   5/1/2022 10:10am    CA  
1                       CA        USA
1   5/2/2022 11:11pm    CT        USA
1                       NJ        USA
1   5/10/2022 12:12pm             UK
2                       VA        USA
2   5/9/2022 12:15am    TX  
2   5/10/2022 09:09am   CA        USA
3                       NY        USA
3   5/16/2022 09:15pm   MO        Japan
3   5/17/2022 04:04am   AL        USA
3   5/20/2022 07:07pm             UK

在此处输入图像描述

Answer 1

I don't think that this is most optimized solution but at least using spark functions if you use below code you will get desired output as in your image for data provided我不认为这是最优化的解决方案，但至少使用 spark 函数，如果您使用以下代码，您将获得所需的输出，如您的图像中提供的数据

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") # for spark 3 

te=spark.read.csv(path="<FilePath>",header=True,inferSchema=True) # replace filepath

te=te.dropna(how="all",subset=["timestamp"]).withColumn("timestamp1",to_timestamp(col("timestamp"), "MM/dd/yyyy hh:mma")) # drop empty timestamps and convert string to timestamp 

from pyspark.sql import Window
from pyspark.sql.functions import col,first,to_timestamp,last,when # required imports

w=Window.partitionBy("empId").orderBy("timestamp1") # window to use later



te2=te.withColumn("first_state",first("stateId",ignorenulls=True).over(w)) \
.withColumn("last_state",last("stateId",ignorenulls=True).over(w)) \
.withColumn("first_country",first("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country",last("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country_timestamp",when(col("countryId")==col("last_country"),col("timestamp1"))) \
.withColumn("first_country_timestamp",when(col("countryId")==col("first_country"),col("timestamp1"))) \
.withColumn("last_state_timestamp",when(col("stateId")==col("last_state"),col("timestamp1"))) \
.withColumn("first_state_timestamp",when(col("stateId")==col("first_state"),col("timestamp1"))).drop(*["timestamp","stateId","countryId"]) # calculation step

te3=te2.groupBy("empId").agg(first("first_state",ignorenulls=True).alias("first_state"), \
                             first("first_state_timestamp",ignorenulls=True).alias("first_stateID_timestamp"), \
                             last("last_state",ignorenulls=True).alias("last_state"), \
                             last("last_state_timestamp",ignorenulls=True).alias("last_stateID_timestamp"), \
                             first("first_country",ignorenulls=True).alias("first_country"), \
                             first("first_country_timestamp",ignorenulls=True).alias("first_countryID_timestamp"), \
                             last("last_country",ignorenulls=True).alias("last_country"), \
                             last("last_country_timestamp",ignorenulls=True).alias("last_countryID_timestamp")) # final df having required output

below is the output you will get in te3 for data you provided以下是您将在 te3 中为您提供的数据获得的输出

+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|empId|first_state|first_stateID_timestamp|last_state|last_stateID_timestamp|first_country|first_countryID_timestamp|last_country|last_countryID_timestamp|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|    1|         CA|    2022-05-01 10:10:00|        CT|   2022-05-02 23:11:00|          USA|      2022-05-02 23:11:00|          UK|     2022-05-10 12:12:00|
|    2|         TX|    2022-05-09 00:15:00|        CA|   2022-05-10 09:09:00|          USA|      2022-05-10 09:09:00|         USA|     2022-05-10 09:09:00|
|    3|         MO|    2022-05-16 21:15:00|        AL|   2022-05-17 04:04:00|        Japan|      2022-05-16 21:15:00|          UK|     2022-05-20 19:07:00|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+

pyspark - 对于每个键值，如何获取所有其他列的第一个和最后一个 notNull 值（基于时间戳）

问题描述

1 个解决方案

解决方案1
0 2022-06-05 20:05:56

pyspark - 对于每个键值，如何获取所有其他列的第一个和最后一个 notNull 值（基于时间戳）

问题描述

1 个解决方案

解决方案1 0 2022-06-05 20:05:56

解决方案1
0 2022-06-05 20:05:56