繁体   English   中英

合并两个查询,其中一个使用GROUP BY

[英]Combining two queries where one uses GROUP BY

我有两张桌子。 TABLE1具有以下列:

pers_key
cost
visit

TABLE2具有以下列:

pers_key
months

首先,我创建一个临时表:

CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;

然后,我创建TABLE3:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key

我想知道是否有更好的方法在这里达到相同的结果。 是否可以在一个查询中完全不创建temp_table来执行此操作? 也许像这样:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key

还是需要临时表才能获得所需的结果集?

仅使用子查询怎么样?

SELECT A.pers_key,
       B.sum_cost / A.months AS ind1,
       B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
     (SELECT pers_key, SUM(cost) AS sum_cost,
             COUNT(DISTINCT visit) AS visit_count
      FROM TABLE1
      GROUP BY pers_key
     ) B
     ON A.pers_key = B.pers_key;

编辑:

您的问题有点复杂。 这绝对是一种合理的方法。 将子查询放在表中并在表上为联接建立索引可能会更快。 但是,红色标记是count(distinct) 以我在Hive中的经验,以下内容比上面的子查询要快:

     (SELECT pers_key, SUM(sum_cost) AS sum_cost,
             COUNT(visit) AS visit_count
      FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
            FROM TABLE1
            GROUP BY pers_key, visit
           ) t
      GROUP BY pers_key
     ) B

(对我而言)此版本更快是有点违反直觉的。 但是,什么情况是,该group by是,蜂巢容易并行化group by秒。 另一方面, count(distinct)是串行处理的。 有时这会发生在其他数据库中(我在Postgres中看到了具有count(distinct)类似行为。还有一个警告:我没有在发现它的地方安装Hive系统,因此可能是某种配置问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM