[英]Combining SQL queries into one with various having/group by/where rownum
[英]Combining two queries where one uses GROUP BY
我有两张桌子。 TABLE1具有以下列:
pers_key
cost
visit
TABLE2具有以下列:
pers_key
months
首先,我创建一个临时表:
CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;
然后,我创建TABLE3:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key
我想知道是否有更好的方法在这里达到相同的结果。 是否可以在一个查询中完全不创建temp_table来执行此操作? 也许像这样:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key
还是需要临时表才能获得所需的结果集?
仅使用子查询怎么样?
SELECT A.pers_key,
B.sum_cost / A.months AS ind1,
B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
(SELECT pers_key, SUM(cost) AS sum_cost,
COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key
) B
ON A.pers_key = B.pers_key;
编辑:
您的问题有点复杂。 这绝对是一种合理的方法。 将子查询放在表中并在表上为联接建立索引可能会更快。 但是,红色标记是count(distinct)
。 以我在Hive中的经验,以下内容比上面的子查询要快:
(SELECT pers_key, SUM(sum_cost) AS sum_cost,
COUNT(visit) AS visit_count
FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
FROM TABLE1
GROUP BY pers_key, visit
) t
GROUP BY pers_key
) B
(对我而言)此版本更快是有点违反直觉的。 但是,什么情况是,该group by
是,蜂巢容易并行化group by
秒。 另一方面, count(distinct)
是串行处理的。 有时这会发生在其他数据库中(我在Postgres中看到了具有count(distinct)
类似行为。还有一个警告:我没有在发现它的地方安装Hive系统,因此可能是某种配置问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.