[英]Spark scala joining with subquery with limit
I need to join two tables on fake_id but table 2 contains more than one matching records for fake_id so I need to match with record where table2.end_time >= table1.event_time and table2.start_time <= table1.event_time If there are more than one record in table 2 matching this condition, I need to only consider latest by updated_time我需要在 fake_id 上加入两个表,但表 2 包含多个匹配的 fake_id 记录,所以我需要匹配记录 where table2.end_time >= table1.event_time 和 table2.start_time <= table1.event_time 如果有多个表2中的记录匹配这个条件,我只需要考虑updated_time的latest
Here is what I tried.这是我尝试过的。
spark.sql("select t1.fake_id, t1.attribute_1,t1.event_time,t22.end_time from table1 t1 left outer join ( select fake_id, end_time from table2 t2 where t2.fake_id=t1.fake_id and t2.end_time >= t1.event_time and t2.start_time <= t1.event_time order by t2.updated_time desc limit 1) as t22 on t1.fake_id=t22.fake_id")
For above statement spark throwing me error for unknown column t1.fake_id对于上述语句,spark 为未知列 t1.fake_id 抛出错误
Table.1 -
---------------------------------------------------------------------------
fake_id attribute_1 event_time
---------------------------------------------------------------------------
1 attr_val_11 2020-08-01 05:00:00
2 attr_val_12 2020-08-01 15:00:00
3 attr_val_31 2020-08-03 07:00:00
4 attr_val_41 2020-08-01 05:00:00
Table.2 -
---------------------------------------------------------------------------
fake_id start_time end_time updated_time
---------------------------------------------------------------------------
1 2020-08-01 02:00:00 2020-08-01 08:00:00 2020-08-01 00:00:00
2 2020-08-01 04:00:00 2020-08-01 23:00:00 2020-08-01 00:00:00
3 2020-08-03 02:00:00 2020-08-03 08:00:00 2020-08-03 08:00:00
3 2020-08-03 05:00:00 2020-08-03 10:00:00 2020-08-03 12:00:00
3 2020-08-04 05:00:00 2020-08-04 10:00:00 2020-08-04 12:00:00
4 2020-08-01 08:00:00 2020-08-01 18:00:00 2020-08-01 18:00:00
4 2020-08-01 02:00:00 2020-08-01 05:00:00 2020-08-01 22:00:00
Result :
----------------------------------------------------------------------------------------------
fake_id attribute_1 event_time start_time end_time
----------------------------------------------------------------------------------------------
1 attr_val_11 2020-08-01 05:00:00 2020-08-01 02:00:00 2020-08-01 08:00:00
2 attr_val_12 2020-08-01 15:00:00 2020-08-01 04:00:00 2020-08-01 23:00:00
3 attr_val_31 2020-08-03 07:00:00 2020-08-03 05:00:00 2020-08-03 10:00:00
4 attr_val_41 2020-08-01 05:00:00 2020-08-01 02:00:00 2020-08-01 05:00:00
Use the between
and get the row_number
, sort and take the maximum update time.使用
between
并获取row_number
,排序并获取最大更新时间。
spark.sql('''
select
fake_id,
attribute_1,
event_time,
start_time,
end_time
from (
select
t1.fake_id,
t1.attribute_1,
t1.event_time,
t2.start_time,
t2.end_time,
row_number() OVER (PARTITION BY t1.fake_id, t1.attribute_1 ORDER BY t2.updated_time DESC) as rank
from
table1 t1
left join
table2 t2
on
t1.fake_id = t2.fake_id and
t1.event_time between t2.start_time and t2.end_time) t
where
rank = 1
order by
fake_id
''').show()
+-------+-----------+-------------------+-------------------+-------------------+
|fake_id|attribute_1| event_time| start_time| end_time|
+-------+-----------+-------------------+-------------------+-------------------+
| 1|attr_val_11|2020-08-01 05:00:00|2020-08-01 02:00:00|2020-08-01 08:00:00|
| 2|attr_val_12|2020-08-01 15:00:00|2020-08-01 04:00:00|2020-08-01 23:00:00|
| 3|attr_val_31|2020-08-03 07:00:00|2020-08-03 05:00:00|2020-08-03 10:00:00|
| 4|attr_val_41|2020-08-01 05:00:00|2020-08-01 02:00:00|2020-08-01 05:00:00|
+-------+-----------+-------------------+-------------------+-------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.