简体   繁体   English

Pyspark:按 ID 和最近日期向后加入 2 个数据框

[英]Pyspark: Joining 2 dataframes by ID & Closest date backwards

I'm having the world of issues performing a rolling join of two dataframes in pyspark (and python in general).我在 pyspark(和一般的 python)中执行两个数据帧的滚动连接时遇到了很多问题。 I am looking to join two pyspark dataframes together by their ID & closest date backwards (meaning the date in the second dataframe cannot be greater than the one in the first)我希望将两个 pyspark 数据帧按其 ID 和最接近的日期向后连接(这意味着第二个 dataframe 中的日期不能大于第一个中的日期)

Table_1:表格1:

+-----+------------+-----+ |编号 |日期 |价值 | +-----+------------+-----+ | A1 | 01-15-2020 | 5 | | A2 | 01-20-2020 | 10 | | A3 | 02-21-2020 | 12 | | A1 | 02-21-2020 | 6 | +-----+------------+------+

Table_2:表_2:

在此处输入图像描述

Desired Result:期望的结果:

ID 日期 值 值 2 A1 01-15-2020 5 5 A2 01-20-2020 10 12 A3 02-21-2020 12 14 A1 01-21-2020 6 3

In essence, I understand an SQL Query can do the trick where I can do spark.sql("query") So anything else.本质上,我知道 SQL 查询可以完成我可以做的事情 spark.sql("query") 所以其他任何事情。 I've tried several things which aren't working in a spark context.我已经尝试了一些在 spark 上下文中不起作用的东西。 Thanks!谢谢!

I would prefer to solve this problem using Window.我更愿意使用 Window 来解决这个问题。
You need to join both datasets using id and date(>=), then you need to know how many days of difference you have in order to filter what you need using dense_rank to just get closest date.您需要使用 id 和 date(>=) 连接两个数据集,然后您需要知道您有多少天的差异,以便使用 dense_rank 过滤您需要的内容以获得最接近的日期。

from pyspark.sql.functions import col, datediff, dense_rank
from pyspark.sql.window import Window
from datetime import date

df1 = (
  spark
  .createDataFrame(
    [
      ("A1",date(2020, 1, 15), 5),
      ("A2",date(2020, 1, 20), 10),
      ("A3",date(2020, 2, 21), 12),
      ("A1",date(2020, 1, 21), 6),
    ],
    ["id_1","date_1","value_1"]
  )
)

df2 = (
  spark
  .createDataFrame(
    [
      ("A1",date(2020, 1, 10), 1),
      ("A1",date(2020, 1, 12), 5),
      ("A1",date(2020, 1, 16), 3),
      ("A2",date(2020, 1, 25), 20),
      ("A2",date(2020, 1, 1), 12),
      ("A3",date(2020, 1, 31), 14),
      ("A3",date(2020, 1, 30), 12)
    ],
    ["id_2","date_2","value_2"]
  )
)

winSpec = Window.partitionBy("value_1").orderBy("date_difference")

df3 = (
  df1
  .join(df2, [df1.id_1==df2.id_2,df1.date_1>=df2.date_2])
  .withColumn("date_difference", datediff("date_1","date_2"))
  .withColumn("dr", dense_rank().over(winSpec))
  .where("dr=1")
  .select(
    col("id_1").alias("id"),
    col("date_1").alias("date"),
    col("value_1"),
    col("value_2")
  )
)

+---+----------+-------+-------+
|id |date      |value_1|value_2|
+---+----------+-------+-------+
|A1 |2020-01-21|6      |3      |
|A1 |2020-01-15|5      |5      |
|A2 |2020-01-20|10     |12     |
|A3 |2020-02-21|12     |14     |
+---+----------+-------+-------+

Here is my trial.这是我的审判。

First, I determine the Date_2 which met your condition.首先,我确定符合您条件的Date_2 After that, join the second dataframe again and get the Value_2之后,再次加入第二个 dataframe 并获得Value_2

from pyspark.sql.functions import monotonically_increasing_id, unix_timestamp, max

df3 = df1.withColumn('newId', monotonically_increasing_id()) \
  .join(df2, 'ID', 'left') \
  .where(unix_timestamp('Date', 'M/dd/yy') >= unix_timestamp('Date_2', 'M/dd/yy')) \
  .groupBy(*df1.columns, 'newId') \
  .agg(max('Date_2').alias('Date_2'))
df3.orderBy('newId').show(20, False)    

+---+-------+-----+-----+-------+
|ID |Date   |Value|newId|Date_2 |
+---+-------+-----+-----+-------+
|A1 |1/15/20|5    |0    |1/12/20|
|A2 |1/20/20|10   |1    |1/11/20|
|A3 |2/21/20|12   |2    |1/31/20|
|A1 |1/21/20|6    |3    |1/16/20|
+---+-------+-----+-----+-------+

df3.join(df2, ['ID', 'Date_2'], 'left') \
  .orderBy('newId') \
  .drop('Date_2', 'newId') \
  .show(20, False)

+---+-------+-----+-------+
|ID |Date   |Value|Value_2|
+---+-------+-----+-------+
|A1 |1/15/20|5    |5      |
|A2 |1/20/20|10   |12     |
|A3 |2/21/20|12   |14     |
|A1 |1/21/20|6    |3      |
+---+-------+-----+-------+
df1=spark.createDataFrame([('A1','1/15/2020',5),
                           ('A2','1/20/2020',10), 
                           ('A3','2/21/2020',12),
                           ('A1','1/21/2020',6)],
                           ['ID1','Date1','Value1'])

df2=spark.createDataFrame([('A1','1/10/2020',1),
                           ('A1','1/12/2020',5),
                           ('A1','1/16/2020',3),
                           ('A2','1/25/2020',20),
                           ('A2','1/1/2020',12),
                           ('A3','1/31/2020',14),
                           ('A3','1/30/2020',12)],['ID2','Date2','Value2'])

df2=df1.join(df2,df1.ID1==df2.ID2) \
    .withColumn("distance",datediff(to_date(df1.Date1,'MM/dd/yyyy'),\
     to_date(df2.Date2,'MM/dd/yyyy'))).filter("distance>0")

df2.groupBy(df2.ID1,df2.Date1,df2.Value1)\
   .agg(min(df2.distance).alias('distance')).join(df2, ['ID1','Date1','distance'])\
   .select(df2.ID1,df2.Date1,df2.Value1,df2.Value2).orderBy('ID1','Date1').show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM