I have a table like this:
group val1 val2 val3
group1 5 . .
group1 . 2 1
group1 . . 3
group2 1 4 .
group2 . . 8
group2 2 . 7
I need to count the occurrences of all possible combinations for each group in Hive, using null values (.) as a wildcard. This would give me results like this:
group val1 val2 val3 cnt
group1 5 2 1 2
group1 5 2 3 2
group2 1 4 8 2
group2 2 4 8 1
group2 2 4 7 1
I know I can do this by selecting all distinct group-val1 pairs, full joining this with all distinct group-val2 pairs, and full joining this with all distinct group-val3 pairs. This gives me all possible combinations for each group, which I can then inner join with my table, counting cases where a row of my original data is a subset of a combination.
Something like this:
create table my_results as
with combos as (
select *
from (select distinct group, val1 from data) A
full join (select distinct group, val2 from data) B
on A.group = B.group
full join (select distinct group, val3 from data) C
on A.group = C.group
)
select A.group, A.val1, A.val2, A.val3, count(*)
from combos A
inner join data B
on A.group = B.group
and (A.val1 = B.val1 OR B.val1 is null)
and (A.val2 = B.val2 OR B.val2 is null)
and (A.val3 = B.val3 OR B.val3 is null)
group by A.group, A.val1, A.val2, A.val3
But, My dataset is very large (100s of millions of rows). and the number of all possible combinations I can expect is also very large (10s of thousands). Such a join is just too big.
Is there another way? I wondered if I could use regular expressions, but I don't know where to start.
In your sample data, only the third column has multiple values. So, you could just fill in the one value for the two other columns:
select group,
max(max(col1)) over (partition by group) as col1,
max(max(col2)) over (partition by group) as col2,
col3,
count(*)
from data
group by group;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.