[英]Pyspark: Joining 2 dataframes by ID & Closest date backwards
I'm having the world of issues performing a rolling join of two dataframes in pyspark (and python in general).我在 pyspark(和一般的 python)中执行两个数据帧的滚动连接时遇到了很多问题。 I am looking to join two pyspark dataframes together by their ID & closest date backwards (meaning the date in the second dataframe cannot be greater than the one in the first)我希望将两个 pyspark 数据帧按其 ID 和最接近的日期向后连接(这意味着第二个 dataframe 中的日期不能大于第一个中的日期)
Table_1:表格1:
Table_2:表_2:
Desired Result:期望的结果:
In essence, I understand an SQL Query can do the trick where I can do spark.sql("query") So anything else.本质上,我知道 SQL 查询可以完成我可以做的事情 spark.sql("query") 所以其他任何事情。 I've tried several things which aren't working in a spark context.我已经尝试了一些在 spark 上下文中不起作用的东西。 Thanks!谢谢!
I would prefer to solve this problem using Window.我更愿意使用 Window 来解决这个问题。
You need to join both datasets using id and date(>=), then you need to know how many days of difference you have in order to filter what you need using dense_rank to just get closest date.您需要使用 id 和 date(>=) 连接两个数据集,然后您需要知道您有多少天的差异,以便使用 dense_rank 过滤您需要的内容以获得最接近的日期。
from pyspark.sql.functions import col, datediff, dense_rank
from pyspark.sql.window import Window
from datetime import date
df1 = (
spark
.createDataFrame(
[
("A1",date(2020, 1, 15), 5),
("A2",date(2020, 1, 20), 10),
("A3",date(2020, 2, 21), 12),
("A1",date(2020, 1, 21), 6),
],
["id_1","date_1","value_1"]
)
)
df2 = (
spark
.createDataFrame(
[
("A1",date(2020, 1, 10), 1),
("A1",date(2020, 1, 12), 5),
("A1",date(2020, 1, 16), 3),
("A2",date(2020, 1, 25), 20),
("A2",date(2020, 1, 1), 12),
("A3",date(2020, 1, 31), 14),
("A3",date(2020, 1, 30), 12)
],
["id_2","date_2","value_2"]
)
)
winSpec = Window.partitionBy("value_1").orderBy("date_difference")
df3 = (
df1
.join(df2, [df1.id_1==df2.id_2,df1.date_1>=df2.date_2])
.withColumn("date_difference", datediff("date_1","date_2"))
.withColumn("dr", dense_rank().over(winSpec))
.where("dr=1")
.select(
col("id_1").alias("id"),
col("date_1").alias("date"),
col("value_1"),
col("value_2")
)
)
+---+----------+-------+-------+
|id |date |value_1|value_2|
+---+----------+-------+-------+
|A1 |2020-01-21|6 |3 |
|A1 |2020-01-15|5 |5 |
|A2 |2020-01-20|10 |12 |
|A3 |2020-02-21|12 |14 |
+---+----------+-------+-------+
Here is my trial.这是我的审判。
First, I determine the Date_2
which met your condition.首先,我确定符合您条件的Date_2
。 After that, join the second dataframe again and get the Value_2
之后,再次加入第二个 dataframe 并获得Value_2
from pyspark.sql.functions import monotonically_increasing_id, unix_timestamp, max
df3 = df1.withColumn('newId', monotonically_increasing_id()) \
.join(df2, 'ID', 'left') \
.where(unix_timestamp('Date', 'M/dd/yy') >= unix_timestamp('Date_2', 'M/dd/yy')) \
.groupBy(*df1.columns, 'newId') \
.agg(max('Date_2').alias('Date_2'))
df3.orderBy('newId').show(20, False)
+---+-------+-----+-----+-------+
|ID |Date |Value|newId|Date_2 |
+---+-------+-----+-----+-------+
|A1 |1/15/20|5 |0 |1/12/20|
|A2 |1/20/20|10 |1 |1/11/20|
|A3 |2/21/20|12 |2 |1/31/20|
|A1 |1/21/20|6 |3 |1/16/20|
+---+-------+-----+-----+-------+
df3.join(df2, ['ID', 'Date_2'], 'left') \
.orderBy('newId') \
.drop('Date_2', 'newId') \
.show(20, False)
+---+-------+-----+-------+
|ID |Date |Value|Value_2|
+---+-------+-----+-------+
|A1 |1/15/20|5 |5 |
|A2 |1/20/20|10 |12 |
|A3 |2/21/20|12 |14 |
|A1 |1/21/20|6 |3 |
+---+-------+-----+-------+
df1=spark.createDataFrame([('A1','1/15/2020',5),
('A2','1/20/2020',10),
('A3','2/21/2020',12),
('A1','1/21/2020',6)],
['ID1','Date1','Value1'])
df2=spark.createDataFrame([('A1','1/10/2020',1),
('A1','1/12/2020',5),
('A1','1/16/2020',3),
('A2','1/25/2020',20),
('A2','1/1/2020',12),
('A3','1/31/2020',14),
('A3','1/30/2020',12)],['ID2','Date2','Value2'])
df2=df1.join(df2,df1.ID1==df2.ID2) \
.withColumn("distance",datediff(to_date(df1.Date1,'MM/dd/yyyy'),\
to_date(df2.Date2,'MM/dd/yyyy'))).filter("distance>0")
df2.groupBy(df2.ID1,df2.Date1,df2.Value1)\
.agg(min(df2.distance).alias('distance')).join(df2, ['ID1','Date1','distance'])\
.select(df2.ID1,df2.Date1,df2.Value1,df2.Value2).orderBy('ID1','Date1').show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.