[英]Left Join with conditions and aggregate MAX using Spark Python / PySpark
What I have : 2 massive spark dataframes, but here are some samples我所拥有的: 2 个大型 spark 数据帧,但这里有一些示例
ID ![]() |
IG ![]() |
OpenDate![]() |
---|---|---|
P111 ![]() |
100 ![]() |
13/04/2022 ![]() |
P222 ![]() |
101 ![]() |
16/04/2022 ![]() |
P333 ![]() |
102 ![]() |
20/04/2022 ![]() |
IG ![]() |
Service![]() |
Dt_Service ![]() |
---|---|---|
100 ![]() |
A![]() |
12/04/2022 ![]() |
100 ![]() |
B![]() |
13/04/2022 ![]() |
100 ![]() |
B![]() |
14/04/2022 ![]() |
101 ![]() |
A![]() |
15/04/2022 ![]() |
101 ![]() |
A![]() |
16/04/2022 ![]() |
101 ![]() |
B![]() |
17/04/2022 ![]() |
101 ![]() |
B![]() |
18/04/2022 ![]() |
102 ![]() |
A![]() |
19/04/2022 ![]() |
102 ![]() |
B![]() |
20/04/2022 ![]() |
What I want: I want to left join on dataframe A the two columns 'Service' and 'Dt_Service' using the key 'IG' but also having the Max value of 'Service' with the corresponding date.我想要什么:我想在数据框 A 上使用键 'IG' 加入'Service' 和 'Dt_Service' 两列,但同时具有相应日期的'Service' 的最大值。 So I need the most recent 'Service' with its corresponding date for each row in Dataframe A. This is the result I expect :
因此,我需要最新的“服务”及其数据框 A 中每一行的相应日期。这是我期望的结果:
ID ![]() |
IG ![]() |
OpenDate![]() |
Service![]() |
Dt_Service ![]() |
---|---|---|---|---|
P111 ![]() |
100 ![]() |
13/04/2022 ![]() |
B![]() |
14/04/2022 ![]() |
P222 ![]() |
101 ![]() |
16/04/2022 ![]() |
B![]() |
18/04/2022 ![]() |
P333 ![]() |
102 ![]() |
20/04/2022 ![]() |
B![]() |
20/04/2022 ![]() |
Tool : Spark 2.2 with PySpark since I am working on hadoop工具:使用 PySpark 的 Spark 2.2,因为我正在研究 hadoop
Thank you for your help谢谢您的帮助
As samkart said we can do rank/row_number to get last service first then join to get your desired result正如 samkart 所说,我们可以先进行 rank/row_number 以获得最后一次服务,然后加入以获得您想要的结果
from pyspark.sql import functions as F
from pyspark.sql import Window
se="IG string,Service string,Dt_Service string"
de=[("100","A","2022-04-12"),("100","B","2022-04-13"),("100","B","2022-04-14"),("101","A","2022-04-15"),("101","A","2022-04-16"),("101","B","2022-04-17"),("101","B","2022-04-18"),("102","A","2022-04-19"),("102","B","2022-04-20")]
df1=spark.createDataFrame([("P111","100","13/04/2022"),("P222","101","16/04/2022"),("P333","102","20/04/2022")],"ID string,IG string, OpenDate string")
df2=fd.withColumn("rn",F.row_number().over(Window.partitionBy("ig").orderBy(F.to_date(F.col("Dt_service")).desc()))).filter("rn==1").drop("rn")
df1.join(df2,"IG","inner").show()
#output
+---+----+----------+-------+----------+
| IG| ID| OpenDate|Service|Dt_Service|
+---+----+----------+-------+----------+
|100|P111|13/04/2022| B|2022-04-14|
|101|P222|16/04/2022| B|2022-04-18|
|102|P333|20/04/2022| B|2022-04-20|
+---+----+----------+-------+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.