SQL (Hive) group-by 使用空值作为通配符

Question

I have a table like this:我有一张这样的桌子：

group   val1   val2   val3

group1   5      .      .
group1   .      2      1
group1   .      .      3
group2   1      4      .
group2   .      .      8
group2   2      .      7

I need to count the occurrences of all possible combinations for each group in Hive, using null values (.) as a wildcard.我需要使用 null 值 (.) 作为通配符，计算 Hive 中每个组的所有可能组合的出现次数。 This would give me results like this:这会给我这样的结果：

group   val1   val2   val3  cnt

group1   5      2      1     2
group1   5      2      3     2
group2   1      4      8     2
group2   2      4      8     1
group2   2      4      7     1

I know I can do this by selecting all distinct group-val1 pairs, full joining this with all distinct group-val2 pairs, and full joining this with all distinct group-val3 pairs.我知道我可以通过选择所有不同的 group-val1 对，将其与所有不同的 group-val2 对完全连接，并将其与所有不同的 group-val3 对完全连接来做到这一点。 This gives me all possible combinations for each group, which I can then inner join with my table, counting cases where a row of my original data is a subset of a combination.这为我提供了每个组的所有可能组合，然后我可以将它们与我的表进行内部连接，计算我的原始数据行是组合子集的情况。

Something like this:像这样的东西：

create table my_results as 

with combos as (
select *
from (select distinct group, val1 from data) A
full join (select distinct group, val2 from data) B
    on A.group = B.group
full join (select distinct group, val3 from data) C
    on A.group = C.group 
)

select A.group, A.val1, A.val2, A.val3, count(*)
from combos A
inner join data B
    on A.group = B.group
    and (A.val1 = B.val1 OR B.val1 is null)
    and (A.val2 = B.val2 OR B.val2 is null)
    and (A.val3 = B.val3 OR B.val3 is null)
group by A.group, A.val1, A.val2, A.val3

But, My dataset is very large (100s of millions of rows).但是，我的数据集非常大（数百万行）。 and the number of all possible combinations I can expect is also very large (10s of thousands).并且我可以预期的所有可能组合的数量也非常大（成千上万）。 Such a join is just too big.这样的连接太大了。

Is there another way?还有其他方法吗？ I wondered if I could use regular expressions, but I don't know where to start.我想知道是否可以使用正则表达式，但我不知道从哪里开始。

Answer 1

In your sample data, only the third column has multiple values.在您的示例数据中，只有第三列有多个值。 So, you could just fill in the one value for the two other columns:因此，您可以只为另外两列填写一个值：

select group,
       max(max(col1)) over (partition by group) as col1,
       max(max(col2)) over (partition by group) as col2,
       col3,
       count(*)
from data
group by group;

SQL (Hive) group-by 使用空值作为通配符

问题描述

1 个解决方案

解决方案1
0 2019-11-11 23:24:38

SQL (Hive) group-by 使用空值作为通配符

问题描述

1 个解决方案

解决方案1 0 2019-11-11 23:24:38

解决方案1
0 2019-11-11 23:24:38