简体   繁体   中英

How to get distinct count over multiple columns in Hive SQL?

I have a table that looks like this. And I want to get the distinct count across the three columns.

ID Column1 Column 2 Column 3
1 A B C
2 A A B
3 A A

The desired output I'm looking for is:

ID Column1 Column 2 Column 3 unique_count
1 A B C 3
2 A A B 2
3 A A 1
1 +
  case when A not in (B, C) then 1 else 0 end +
  case when B not in (C)    then 1 else 0 end

This will not work if you intend to count nulls. The pattern would extend to more columns by successively comparing each one to all columns to its right. The order doesn't strictly matter. There's just no point in repeating the same test over and over.

If they were alphabetically ordered them you could test only adjacent pairs to look for differences. While that applies to your limited sample it would not be the most general case.

One possible option would be

WITH sample AS (
  SELECT 'A' Column1, 'B' Column2, 'C' Column3 UNION ALL
  SELECT 'A', 'A', 'B' UNION ALL
  SELECT 'A', 'A', NULL
)
SELECT Column1, Column2, Column3, COUNT(DISTINCT c) unique_count
  FROM (SELECT *, ROW_NUMBER() OVER () rn FROM sample) t LATERAL VIEW EXPLODE(ARRAY(Column1, Column2, Column3)) tf AS c
 GROUP BY Column1, Column2, Column3, rn;
output
+---------+---------+---------+--------------+
| column1 | column2 | column3 | unique_count |
+---------+---------+---------+--------------+
| A       | A       | NULL    |            1 |
| A       | A       | B       |            2 |
| A       | B       | C       |            3 |
+---------+---------+---------+--------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM