I have a pyspark dataframe with values like this. I want to get first/last notNull value of state and their corresponding timestamps in 4 separate columns. Do similarly for country column also. Output result will have total 9 columns as mentioned in the bottom. Please help me how to do this in (pyspark or SQL) and fast performance. I have a HUGE table with millions of rows and 10 columns i need to this first/last and their timestamps
empId timestamp stateID countryID
1 5/1/2022 10:10am CA
1 CA USA
1 5/2/2022 11:11pm CT USA
1 NJ USA
1 5/10/2022 12:12pm UK
2 VA USA
2 5/9/2022 12:15am TX
2 5/10/2022 09:09am CA USA
3 NY USA
3 5/16/2022 09:15pm MO Japan
3 5/17/2022 04:04am AL USA
3 5/20/2022 07:07pm UK
I don't think that this is most optimized solution but at least using spark functions if you use below code you will get desired output as in your image for data provided
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") # for spark 3
te=spark.read.csv(path="<FilePath>",header=True,inferSchema=True) # replace filepath
te=te.dropna(how="all",subset=["timestamp"]).withColumn("timestamp1",to_timestamp(col("timestamp"), "MM/dd/yyyy hh:mma")) # drop empty timestamps and convert string to timestamp
from pyspark.sql import Window
from pyspark.sql.functions import col,first,to_timestamp,last,when # required imports
w=Window.partitionBy("empId").orderBy("timestamp1") # window to use later
te2=te.withColumn("first_state",first("stateId",ignorenulls=True).over(w)) \
.withColumn("last_state",last("stateId",ignorenulls=True).over(w)) \
.withColumn("first_country",first("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country",last("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country_timestamp",when(col("countryId")==col("last_country"),col("timestamp1"))) \
.withColumn("first_country_timestamp",when(col("countryId")==col("first_country"),col("timestamp1"))) \
.withColumn("last_state_timestamp",when(col("stateId")==col("last_state"),col("timestamp1"))) \
.withColumn("first_state_timestamp",when(col("stateId")==col("first_state"),col("timestamp1"))).drop(*["timestamp","stateId","countryId"]) # calculation step
te3=te2.groupBy("empId").agg(first("first_state",ignorenulls=True).alias("first_state"), \
first("first_state_timestamp",ignorenulls=True).alias("first_stateID_timestamp"), \
last("last_state",ignorenulls=True).alias("last_state"), \
last("last_state_timestamp",ignorenulls=True).alias("last_stateID_timestamp"), \
first("first_country",ignorenulls=True).alias("first_country"), \
first("first_country_timestamp",ignorenulls=True).alias("first_countryID_timestamp"), \
last("last_country",ignorenulls=True).alias("last_country"), \
last("last_country_timestamp",ignorenulls=True).alias("last_countryID_timestamp")) # final df having required output
below is the output you will get in te3 for data you provided
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|empId|first_state|first_stateID_timestamp|last_state|last_stateID_timestamp|first_country|first_countryID_timestamp|last_country|last_countryID_timestamp|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
| 1| CA| 2022-05-01 10:10:00| CT| 2022-05-02 23:11:00| USA| 2022-05-02 23:11:00| UK| 2022-05-10 12:12:00|
| 2| TX| 2022-05-09 00:15:00| CA| 2022-05-10 09:09:00| USA| 2022-05-10 09:09:00| USA| 2022-05-10 09:09:00|
| 3| MO| 2022-05-16 21:15:00| AL| 2022-05-17 04:04:00| Japan| 2022-05-16 21:15:00| UK| 2022-05-20 19:07:00|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.