在 Hive 中连接日期范围内的表

Question

I need to join tableA to tableB on employee_id and the cal_date from table A need to be between date start and date end from table B. I ran below query and received below error message, Would you please help me to correct and query.我需要在employee_id 上将tableA 连接到tableB，并且表A 中的cal_date 需要在表B 的日期开始和日期结束之间。我在查询下方运行并收到以下错误消息，请您帮我纠正和查询。 Thank you for you help!谢谢你的帮助！

Both left and right aliases encountered in JOIN 'date_start' .在 JOIN 'date_start' 中遇到左右别名。

select a.*, b.skill_group 
from tableA a 
  left join tableB b 
    on a.employee_id= b.employee_id 
    and a.cal_date >= b.date_start 
    and a.cal_date <= b.date_end

Answer 1

RTFM - quoting LanguageManual Joins RTFM - 引用LanguageManual Joins

Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Hive 不支持不是相等条件的连接条件，因为很难将此类条件表示为 map/reduce 作业。

You may try to move the BETWEEN filter to a WHERE clause, resulting in a lousy partially-cartesian-join followed by a post-processing cleanup.您可能会尝试将 BETWEEN 过滤器移动到 WHERE 子句，从而导致糟糕的部分笛卡尔连接，然后进行后处理清理。 Yuck.哎呀。 Depending on the actual cardinality of your "skill group" table, it may work fast - or take whole days.根据“技能组”表的实际基数，它可能工作得很快——或者需要一整天。

Answer 2

If your situation allows, do it in two queries.如果您的情况允许，请分两次查询。

First with the full join, which can have the range;首先是全连接，可以有范围； Then with an outer join, matching on all the columns, but include a where clause for where one of the fields is null.然后使用外连接，匹配所有列，但包括一个 where 子句，其中一个字段为空。

Ex:前任：

create table tableC as
select a.*, b.skill_group 
    from tableA a 
    ,    tableB b 
    where a.employee_id= b.employee_id 
      and a.cal_date >= b.date_start 
      and a.cal_date <= b.date_end;

with c as (select * from TableC)
insert into tableC
select a.*, cast(null as string) as skill_group
from tableA a 
  left join c
    on (a.employee_id= c.employee_id 
    and a.cal_date  = c.cal_date)
where c.employee_id is null ;

Answer 3

MarkWusinich had a great solution but with one major issue. MarkWusinich 有一个很好的解决方案，但有一个主要问题。 If table a has an employee ID twice within the date range table c will also have that employee_ID twice (if b was unique if not more) creating 4 records after the join.如果表 a 在日期范围内有两次员工 ID，表 c 也将有两次该员工 ID（如果 b 是唯一的，如果不是更多），则在连接后创建 4 条记录。 As such if A is not unique on employee_ID a group by will be necessary.因此，如果 A 在employee_ID 上不是唯一的，则需要一个 group by。 Corrected below:更正如下：

with C as
(select a.employee_id, b.skill_group 
    from tableA a 
    ,    tableB b 
    where a.employee_id= b.employee_id 
      and a.cal_date >= b.date_start 
      and a.cal_date <= b.date_end
group by a.employee_id, b.skill_group
) C
select a.*, c.skill_group
from tableA a 
left join c
  on a.employee_id = c.employee_id 
    and a.cal_date  = c.cal_date;

Please note: If B was somehow intentionally not distinct on (employee_id, skill_group), then my query above would also have to be modified to appropriately reflect that.请注意：如果 B 以某种方式故意不区分 (employee_id, Skill_group)，那么我上面的查询也必须修改以适当地反映这一点。

在 Hive 中连接日期范围内的表

问题描述

3 个解决方案

解决方案1
3 已采纳 2016-03-12 02:12:29

解决方案2
2 2018-11-12 22:52:13

解决方案3
0 2020-11-25 15:17:17

在 Hive 中连接日期范围内的表

问题描述

3 个解决方案

解决方案1 3 已采纳 2016-03-12 02:12:29

解决方案2 2 2018-11-12 22:52:13

解决方案3 0 2020-11-25 15:17:17

解决方案1
3 已采纳 2016-03-12 02:12:29

解决方案2
2 2018-11-12 22:52:13

解决方案3
0 2020-11-25 15:17:17