Efficient query to Group by column name in SQL or hive

Question

Imagine I have a table with 2 columns m_1 and m_2:

I would like to get a table with 3 columns:

m is the index of m (in my exemple 1 or 2)
d is the data contains in the table .
count is the number of occurence of each data, group by value and index.

In the example, the result is:

m   | d | count
m_1 | 3 | 2
m_1 | 4 | 1
m_1 | 9 | 1
m_2 | 17| 2
m_2 | 18| 1
m_2 | 9 | 1

The first ligne mus be read as 'data 3 occurs 2 times in column m_1'?

A naive solution is to execute two times a parametric query like this:

for (i in 1 .. 2) 
    SELECT CONCAT('m_', i), m_i, count(*) FROM table GROUP BY m_i

But this algorithm scans my table two times. This is a problem since I have 255 columns m and bilion of rows.

Will the solution becomes easier if I use hive instead of a relational data base?

Answer 1

You can write this using union all and group by :

select colname, d, count(*)
from ((select 'm_1' as colname, m1 as d from t) union all
      (select 'm_2' as colname, m2 as d from t) 
     ) m12
group by colname, d;

Answer 2

posexplode(array(m1,m2))

select      concat('m_',cast(pe.pos+1 as string))   as m
           ,pe.val                                  as d
           ,count(*)                                as `count` 

from        mytable t 
            lateral view posexplode(array(m1,m2)) pe 

group by    pos
           ,val
;

+------+-----+--------+
|  m   |  d  | count  |
+------+-----+--------+
| m_1  | 3   | 2      |
| m_1  | 4   | 1      |
| m_1  | 9   | 1      |
| m_2  | 9   | 1      |
| m_2  | 17  | 2      |
| m_2  | 18  | 1      |
+------+-----+--------+

Efficient query to Group by column name in SQL or hive

Question

2 answers

solution1
2 2017-09-18 14:00:31

solution2
0 2017-09-18 19:13:46

Efficient query to Group by column name in SQL or hive

Question

2 answers

solution1 2 2017-09-18 14:00:31

solution2 0 2017-09-18 19:13:46

solution1
2 2017-09-18 14:00:31

solution2
0 2017-09-18 19:13:46