简体   繁体   English

左加入条件并使用 Spark Python / PySpark 聚合 MAX

[英]Left Join with conditions and aggregate MAX using Spark Python / PySpark

What I have : 2 massive spark dataframes, but here are some samples我所拥有的: 2 个大型 spark 数据帧,但这里有一些示例

  • Dataframe A:数据框A:
ID ID IG IG OpenDate开放日期
P111 P111 100 100 13/04/2022 2022 年 13 月 4 日
P222 P222 101 101 16/04/2022 16/04/2022
P333 P333 102 102 20/04/2022 20/04/2022
  • Dataframe B:数据框 B:
IG IG Service服务 Dt_Service Dt_Service
100 100 A一个 12/04/2022 2022 年 12 月 4 日
100 100 B 13/04/2022 2022 年 13 月 4 日
100 100 B 14/04/2022 14/04/2022
101 101 A一个 15/04/2022 15/04/2022
101 101 A一个 16/04/2022 16/04/2022
101 101 B 17/04/2022 17/04/2022
101 101 B 18/04/2022 2022 年 4 月 18 日
102 102 A一个 19/04/2022 19/04/2022
102 102 B 20/04/2022 20/04/2022

What I want: I want to left join on dataframe A the two columns 'Service' and 'Dt_Service' using the key 'IG' but also having the Max value of 'Service' with the corresponding date.我想要什么:我想在数据框 A 上使用键 'IG' 加入'Service' 和 'Dt_Service' 两列,但同时具有相应日期的'Service' 的最大值。 So I need the most recent 'Service' with its corresponding date for each row in Dataframe A. This is the result I expect :因此,我需要最新的“服务”及其数据框 A 中每一行的相应日期。这是我期望的结果:

ID ID IG IG OpenDate开放日期 Service服务 Dt_Service Dt_Service
P111 P111 100 100 13/04/2022 2022 年 13 月 4 日 B 14/04/2022 14/04/2022
P222 P222 101 101 16/04/2022 16/04/2022 B 18/04/2022 2022 年 4 月 18 日
P333 P333 102 102 20/04/2022 20/04/2022 B 20/04/2022 20/04/2022

Tool : Spark 2.2 with PySpark since I am working on hadoop工具:使用 PySpark 的 Spark 2.2,因为我正在研究 hadoop

Thank you for your help谢谢您的帮助

As samkart said we can do rank/row_number to get last service first then join to get your desired result正如 samkart 所说,我们可以先进行 rank/row_number 以获得最后一次服务,然后加入以获得您想要的结果

from pyspark.sql import functions as F 
from pyspark.sql import Window
se="IG string,Service string,Dt_Service string"
de=[("100","A","2022-04-12"),("100","B","2022-04-13"),("100","B","2022-04-14"),("101","A","2022-04-15"),("101","A","2022-04-16"),("101","B","2022-04-17"),("101","B","2022-04-18"),("102","A","2022-04-19"),("102","B","2022-04-20")]

df1=spark.createDataFrame([("P111","100","13/04/2022"),("P222","101","16/04/2022"),("P333","102","20/04/2022")],"ID string,IG string, OpenDate string")
df2=fd.withColumn("rn",F.row_number().over(Window.partitionBy("ig").orderBy(F.to_date(F.col("Dt_service")).desc()))).filter("rn==1").drop("rn")
df1.join(df2,"IG","inner").show()

#output
+---+----+----------+-------+----------+
| IG|  ID|  OpenDate|Service|Dt_Service|
+---+----+----------+-------+----------+
|100|P111|13/04/2022|      B|2022-04-14|
|101|P222|16/04/2022|      B|2022-04-18|
|102|P333|20/04/2022|      B|2022-04-20|
+---+----+----------+-------+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM