![](/img/trans.png)
[英]Python, replace all integers in the last N columns based on the value in the first column for each row
[英]pyspark - For each key value, how to get first and last notNull value (based on timestamp) for all other columns
我有一個具有這樣值的 pyspark 數據框。 我想在 4 個單獨的列中獲取狀態的第一個/最后一個 notNull 值及其相應的時間戳。 對國家列也做同樣的事情。 如底部所述,輸出結果將總共有 9 列。 請幫助我如何在(pyspark 或 SQL)和快速性能中做到這一點。 我有一個巨大的表,其中包含數百萬行和 10 列我需要第一個/最后一個及其時間戳
empId timestamp stateID countryID
1 5/1/2022 10:10am CA
1 CA USA
1 5/2/2022 11:11pm CT USA
1 NJ USA
1 5/10/2022 12:12pm UK
2 VA USA
2 5/9/2022 12:15am TX
2 5/10/2022 09:09am CA USA
3 NY USA
3 5/16/2022 09:15pm MO Japan
3 5/17/2022 04:04am AL USA
3 5/20/2022 07:07pm UK
我不認為這是最優化的解決方案,但至少使用 spark 函數,如果您使用以下代碼,您將獲得所需的輸出,如您的圖像中提供的數據
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") # for spark 3
te=spark.read.csv(path="<FilePath>",header=True,inferSchema=True) # replace filepath
te=te.dropna(how="all",subset=["timestamp"]).withColumn("timestamp1",to_timestamp(col("timestamp"), "MM/dd/yyyy hh:mma")) # drop empty timestamps and convert string to timestamp
from pyspark.sql import Window
from pyspark.sql.functions import col,first,to_timestamp,last,when # required imports
w=Window.partitionBy("empId").orderBy("timestamp1") # window to use later
te2=te.withColumn("first_state",first("stateId",ignorenulls=True).over(w)) \
.withColumn("last_state",last("stateId",ignorenulls=True).over(w)) \
.withColumn("first_country",first("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country",last("countryID",ignorenulls=True).over(w)) \
.withColumn("last_country_timestamp",when(col("countryId")==col("last_country"),col("timestamp1"))) \
.withColumn("first_country_timestamp",when(col("countryId")==col("first_country"),col("timestamp1"))) \
.withColumn("last_state_timestamp",when(col("stateId")==col("last_state"),col("timestamp1"))) \
.withColumn("first_state_timestamp",when(col("stateId")==col("first_state"),col("timestamp1"))).drop(*["timestamp","stateId","countryId"]) # calculation step
te3=te2.groupBy("empId").agg(first("first_state",ignorenulls=True).alias("first_state"), \
first("first_state_timestamp",ignorenulls=True).alias("first_stateID_timestamp"), \
last("last_state",ignorenulls=True).alias("last_state"), \
last("last_state_timestamp",ignorenulls=True).alias("last_stateID_timestamp"), \
first("first_country",ignorenulls=True).alias("first_country"), \
first("first_country_timestamp",ignorenulls=True).alias("first_countryID_timestamp"), \
last("last_country",ignorenulls=True).alias("last_country"), \
last("last_country_timestamp",ignorenulls=True).alias("last_countryID_timestamp")) # final df having required output
以下是您將在 te3 中為您提供的數據獲得的輸出
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
|empId|first_state|first_stateID_timestamp|last_state|last_stateID_timestamp|first_country|first_countryID_timestamp|last_country|last_countryID_timestamp|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
| 1| CA| 2022-05-01 10:10:00| CT| 2022-05-02 23:11:00| USA| 2022-05-02 23:11:00| UK| 2022-05-10 12:12:00|
| 2| TX| 2022-05-09 00:15:00| CA| 2022-05-10 09:09:00| USA| 2022-05-10 09:09:00| USA| 2022-05-10 09:09:00|
| 3| MO| 2022-05-16 21:15:00| AL| 2022-05-17 04:04:00| Japan| 2022-05-16 21:15:00| UK| 2022-05-20 19:07:00|
+-----+-----------+-----------------------+----------+----------------------+-------------+-------------------------+------------+------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.