pyspark - For each key value, how to get first and last notNull value (based on timestamp) for all other columns

Question

I have a pyspark dataframe with values like this. I want to get first/last notNull value of state and their corresponding timestamps in 4 separate columns. Do similarly for country column also. Output result will have total 9 columns as mentioned in the bottom. Please help me how to do this in (pyspark or SQL) and fast performance. I have a HUGE table with millions of rows and 10 columns i need to this first/last and their timestamps

empId   timestamp      stateID  countryID
1   5/1/2022 10:10am    CA  
1                       CA        USA
1   5/2/2022 11:11pm    CT        USA
1                       NJ        USA
1   5/10/2022 12:12pm             UK
2                       VA        USA
2   5/9/2022 12:15am    TX  
2   5/10/2022 09:09am   CA        USA
3                       NY        USA
3   5/16/2022 09:15pm   MO        Japan
3   5/17/2022 04:04am   AL        USA
3   5/20/2022 07:07pm             UK

在此处输入图像描述

Answer 1

I don't think that this is most optimized solution but at least using spark functions if you use below code you will get desired output as in your image for data provided

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") # for spark 3 

te=spark.read.csv(path="<FilePath>",header=True,inferSchema=True) # replace filepath

te=te.dropna(how="all",subset=["timestamp"]).withColumn("timestamp1",to_timestamp(col("timestamp"), "MM/dd/yyyy hh:mma")) # drop empty timestamps and convert string to timestamp 

from pyspark.sql import Window
from pyspark.sql.functions import col,first,to_timestamp,last,when # required imports

w=Window.partitionBy("empId").orderBy("timestamp1") # window to use later



te2=te.withColumn("first_state",first("stateId",ignorenulls=True).over(w)) \
.withColumn("last_state",last("stateId",ignorenulls=True).over(w)) \
.withColumn("first_country",first("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country",last("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country_timestamp",when(col("countryId")==col("last_country"),col("timestamp1"))) \
.withColumn("first_country_timestamp",when(col("countryId")==col("first_country"),col("timestamp1"))) \
.withColumn("last_state_timestamp",when(col("stateId")==col("last_state"),col("timestamp1"))) \
.withColumn("first_state_timestamp",when(col("stateId")==col("first_state"),col("timestamp1"))).drop(*["timestamp","stateId","countryId"]) # calculation step

te3=te2.groupBy("empId").agg(first("first_state",ignorenulls=True).alias("first_state"), \
                             first("first_state_timestamp",ignorenulls=True).alias("first_stateID_timestamp"), \
                             last("last_state",ignorenulls=True).alias("last_state"), \
                             last("last_state_timestamp",ignorenulls=True).alias("last_stateID_timestamp"), \
                             first("first_country",ignorenulls=True).alias("first_country"), \
                             first("first_country_timestamp",ignorenulls=True).alias("first_countryID_timestamp"), \
                             last("last_country",ignorenulls=True).alias("last_country"), \
                             last("last_country_timestamp",ignorenulls=True).alias("last_countryID_timestamp")) # final df having required output

below is the output you will get in te3 for data you provided

+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|empId|first_state|first_stateID_timestamp|last_state|last_stateID_timestamp|first_country|first_countryID_timestamp|last_country|last_countryID_timestamp|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|    1|         CA|    2022-05-01 10:10:00|        CT|   2022-05-02 23:11:00|          USA|      2022-05-02 23:11:00|          UK|     2022-05-10 12:12:00|
|    2|         TX|    2022-05-09 00:15:00|        CA|   2022-05-10 09:09:00|          USA|      2022-05-10 09:09:00|         USA|     2022-05-10 09:09:00|
|    3|         MO|    2022-05-16 21:15:00|        AL|   2022-05-17 04:04:00|        Japan|      2022-05-16 21:15:00|          UK|     2022-05-20 19:07:00|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+

pyspark - For each key value, how to get first and last notNull value (based on timestamp) for all other columns

Question

1 answers

solution1
0 2022-06-05 20:05:56

pyspark - For each key value, how to get first and last notNull value (based on timestamp) for all other columns

Question

1 answers

solution1 0 2022-06-05 20:05:56

solution1
0 2022-06-05 20:05:56