简体   繁体   English

Spark scala 加入带有限制的子查询

[英]Spark scala joining with subquery with limit

I need to join two tables on fake_id but table 2 contains more than one matching records for fake_id so I need to match with record where table2.end_time >= table1.event_time and table2.start_time <= table1.event_time If there are more than one record in table 2 matching this condition, I need to only consider latest by updated_time我需要在 fake_id 上加入两个表,但表 2 包含多个匹配的 fake_id 记录,所以我需要匹配记录 where table2.end_time >= table1.event_time 和 table2.start_time <= table1.event_time 如果有多个表2中的记录匹配这个条件,我只需要考虑updated_time的latest

Here is what I tried.这是我尝试过的。

 spark.sql("select t1.fake_id, t1.attribute_1,t1.event_time,t22.end_time from table1 t1 left outer join ( select fake_id, end_time from table2 t2 where t2.fake_id=t1.fake_id and t2.end_time >= t1.event_time and t2.start_time <= t1.event_time order by t2.updated_time desc limit 1) as t22 on t1.fake_id=t22.fake_id")

For above statement spark throwing me error for unknown column t1.fake_id对于上述语句,spark 为未知列 t1.fake_id 抛出错误

    Table.1 -
    ---------------------------------------------------------------------------
    fake_id     attribute_1     event_time
    ---------------------------------------------------------------------------
    1           attr_val_11     2020-08-01 05:00:00
    2           attr_val_12     2020-08-01 15:00:00 
    3           attr_val_31     2020-08-03 07:00:00
    4           attr_val_41     2020-08-01 05:00:00
    
    Table.2 -
    
---------------------------------------------------------------------------
fake_id     start_time              end_time                updated_time
---------------------------------------------------------------------------
1           2020-08-01 02:00:00     2020-08-01 08:00:00     2020-08-01 00:00:00
2           2020-08-01 04:00:00     2020-08-01 23:00:00     2020-08-01 00:00:00 
3           2020-08-03 02:00:00     2020-08-03 08:00:00     2020-08-03 08:00:00
3           2020-08-03 05:00:00     2020-08-03 10:00:00     2020-08-03 12:00:00
3           2020-08-04 05:00:00     2020-08-04 10:00:00     2020-08-04 12:00:00
4           2020-08-01 08:00:00     2020-08-01 18:00:00     2020-08-01 18:00:00
4           2020-08-01 02:00:00     2020-08-01 05:00:00     2020-08-01 22:00:00



Result :

----------------------------------------------------------------------------------------------
fake_id     attribute_1     event_time              start_time          end_time    
----------------------------------------------------------------------------------------------
1           attr_val_11     2020-08-01 05:00:00     2020-08-01 02:00:00     2020-08-01 08:00:00 
2           attr_val_12     2020-08-01 15:00:00     2020-08-01 04:00:00     2020-08-01 23:00:00
3           attr_val_31     2020-08-03 07:00:00     2020-08-03 05:00:00     2020-08-03 10:00:00
4           attr_val_41     2020-08-01 05:00:00     2020-08-01 02:00:00     2020-08-01 05:00:00

Use the between and get the row_number , sort and take the maximum update time.使用between并获取row_number ,排序并获取最大更新时间。

spark.sql('''
    select
        fake_id,
        attribute_1,
        event_time,
        start_time,
        end_time
    from (
        select 
            t1.fake_id, 
            t1.attribute_1,
            t1.event_time,
            t2.start_time,
            t2.end_time,
            row_number() OVER (PARTITION BY t1.fake_id, t1.attribute_1 ORDER BY t2.updated_time DESC) as rank
        from 
            table1 t1
        left join
            table2 t2
        on
          t1.fake_id = t2.fake_id and 
          t1.event_time between t2.start_time and t2.end_time) t
    where 
        rank = 1
    order by 
        fake_id
''').show()


+-------+-----------+-------------------+-------------------+-------------------+
|fake_id|attribute_1|         event_time|         start_time|           end_time|
+-------+-----------+-------------------+-------------------+-------------------+
|      1|attr_val_11|2020-08-01 05:00:00|2020-08-01 02:00:00|2020-08-01 08:00:00|
|      2|attr_val_12|2020-08-01 15:00:00|2020-08-01 04:00:00|2020-08-01 23:00:00|
|      3|attr_val_31|2020-08-03 07:00:00|2020-08-03 05:00:00|2020-08-03 10:00:00|
|      4|attr_val_41|2020-08-01 05:00:00|2020-08-01 02:00:00|2020-08-01 05:00:00|
+-------+-----------+-------------------+-------------------+-------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM