左加入条件并使用 Spark Python / PySpark 聚合 MAX

Question

What I have : 2 massive spark dataframes, but here are some samples我所拥有的： 2 个大型 spark 数据帧，但这里有一些示例

Dataframe A:数据框A：

ID ID	IG IG	OpenDate开放日期
P111 P111	100 100	13/04/2022 2022 年 13 月 4 日
P222 P222	101 101	16/04/2022 16/04/2022
P333 P333	102 102	20/04/2022 20/04/2022

Dataframe B:数据框 B：

IG IG	Service服务	Dt_Service Dt_Service
100 100	A一个	12/04/2022 2022 年 12 月 4 日
100 100	B乙	13/04/2022 2022 年 13 月 4 日
100 100	B乙	14/04/2022 14/04/2022
101 101	A一个	15/04/2022 15/04/2022
101 101	A一个	16/04/2022 16/04/2022
101 101	B乙	17/04/2022 17/04/2022
101 101	B乙	18/04/2022 2022 年 4 月 18 日
102 102	A一个	19/04/2022 19/04/2022
102 102	B乙	20/04/2022 20/04/2022

What I want: I want to left join on dataframe A the two columns 'Service' and 'Dt_Service' using the key 'IG' but also having the Max value of 'Service' with the corresponding date.我想要什么：我想在数据框 A 上使用键 'IG' 加入'Service' 和 'Dt_Service' 两列，但同时具有相应日期的'Service' 的最大值。 So I need the most recent 'Service' with its corresponding date for each row in Dataframe A. This is the result I expect :因此，我需要最新的“服务”及其数据框 A 中每一行的相应日期。这是我期望的结果：

ID ID	IG IG	OpenDate开放日期	Service服务	Dt_Service Dt_Service
P111 P111	100 100	13/04/2022 2022 年 13 月 4 日	B乙	14/04/2022 14/04/2022
P222 P222	101 101	16/04/2022 16/04/2022	B乙	18/04/2022 2022 年 4 月 18 日
P333 P333	102 102	20/04/2022 20/04/2022	B乙	20/04/2022 20/04/2022

Tool : Spark 2.2 with PySpark since I am working on hadoop工具：使用 PySpark 的 Spark 2.2，因为我正在研究 hadoop

Thank you for your help谢谢您的帮助

Answer 1

As samkart said we can do rank/row_number to get last service first then join to get your desired result正如 samkart 所说，我们可以先进行 rank/row_number 以获得最后一次服务，然后加入以获得您想要的结果

from pyspark.sql import functions as F 
from pyspark.sql import Window
se="IG string,Service string,Dt_Service string"
de=[("100","A","2022-04-12"),("100","B","2022-04-13"),("100","B","2022-04-14"),("101","A","2022-04-15"),("101","A","2022-04-16"),("101","B","2022-04-17"),("101","B","2022-04-18"),("102","A","2022-04-19"),("102","B","2022-04-20")]

df1=spark.createDataFrame([("P111","100","13/04/2022"),("P222","101","16/04/2022"),("P333","102","20/04/2022")],"ID string,IG string, OpenDate string")
df2=fd.withColumn("rn",F.row_number().over(Window.partitionBy("ig").orderBy(F.to_date(F.col("Dt_service")).desc()))).filter("rn==1").drop("rn")
df1.join(df2,"IG","inner").show()

#output
+---+----+----------+-------+----------+
| IG|  ID|  OpenDate|Service|Dt_Service|
+---+----+----------+-------+----------+
|100|P111|13/04/2022|      B|2022-04-14|
|101|P222|16/04/2022|      B|2022-04-18|
|102|P333|20/04/2022|      B|2022-04-20|
+---+----+----------+-------+----------+

左加入条件并使用 Spark Python / PySpark 聚合 MAX

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-13 06:33:48

左加入条件并使用 Spark Python / PySpark 聚合 MAX

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-13 06:33:48

解决方案1
0 已采纳 2022-06-13 06:33:48