[英]INNER join two PySpark dataframe on the user defined functions [next Date]
Here is my two PySpark Dataframe 这是我的两个PySpark数据框
a = sc.parallelize([['2017-05-14', 'foo' , 24 , 'abc'],
['2017-05-16', 'user1', 26, 'mno'],
['2017-05-17', 'user2', 26, 'mno'],
['2017-05-19', 'user2', 27, 'mno'],
['2017-05-19', 'user3', 28, 'mno']])
.toDF(['A_Date', 'user', 'id','info'])
b = sc.parallelize([['2017-05-15', 'foo', 24, 'def'],
['2017-05-22', 'user2', 27, 'mno'],
['2017-05-20', 'user3', 28, 'mno']])
.toDF(['B_Date', 'user', 'id','info'])
and i want to join two dataframe compairing some dataframe and date of dataframe in a joined data should be just less than dataframe b as shown below. 我想加入两个数据帧,对一些数据帧进行补偿,并且合并数据中的数据帧的日期应小于数据帧b,如下所示。
c = sc.parallelize([['2017-05-15', 'foo', 24, 'def', '2017-05-14'],
['2017-05-22', 'user2', 27, 'mno', '2017-05-19'],
['2017-05-20', 'user3', 28,'mno','2017-05-19']])
.toDF(['B_Date', 'user', 'id','info', 'A_Date'])
You could use the following approach.. 您可以使用以下方法。
b.join(a,((a.A_Date<b.B_Date) & (a.user == b.user)))\
.select(b.B_Date,b.user,b.id,b.info,a.A_Date)\
.groupby('B_Date','user','id','info')\
.agg(F.max("A_Date").alias("A_Date"))\
.sort('B_Date')\
.show()
This results in the required output: 这将产生所需的输出:
+----------+-----+---+----+----------+
| B_Date| user| id|info| A_Date|
+----------+-----+---+----+----------+
|2017-05-15| foo| 24| def|2017-05-14|
|2017-05-20|user3| 28| mno|2017-05-19|
|2017-05-22|user2| 27| mno|2017-05-19|
+----------+-----+---+----+----------+
This could be relatively slow because of the cross join. 由于交叉连接,这可能相对较慢。
Alternatively you can use a window function: 或者,您可以使用窗口函数:
a_lagged = a.withColumn('prev_A_Date', F.lag(a['A_Date']).over(windowSpec))
b.join(a_lagged,((a_lagged.A_Date<b.B_Date) & ((a.A_Date>a_lagged.prev_A_Date) | a_lagged.prev_A_Date.isNull() ) & (a_lagged.user == b.user)))\
.select(b.B_Date,b.user,b.id,b.info,a_lagged.A_Date)\
.sort('B_Date')\
.show()
This also gives: 这也给出:
+----------+-----+---+----+----------+
| B_Date| user| id|info| A_Date|
+----------+-----+---+----+----------+
|2017-05-15| foo| 24| def|2017-05-14|
|2017-05-20|user3| 28| mno|2017-05-19|
|2017-05-22|user2| 27| mno|2017-05-19|
+----------+-----+---+----+----------+
If you look at the source code of join
如果看一下
join
的源代码
def join(self, other, on=None, how=None):
"""Joins with another :class:`DataFrame`, using the given join expression.
:param other: Right side of the join
:param on: a string for the join column name, a list of column names,
a join expression (Column), or a list of Columns.
If `on` is a string or a list of strings indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default 'inner'.
One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.
Its clear the on
parameter can be any condition. 其清除
on
参数可以是任何条件。 So you can do the following for checking the dates while joining . 因此,您可以执行以下步骤在加入时检查日期 。
b.join(a, [a.user == b.user, a.id == b.id, a.A_Date < b.B_Date]).select(b.B_Date, b.user, b.id, a.A_Date)
You should have your desired output dataframe
您应该具有所需的输出
dataframe
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.