So we have an ETL process that pulls data from multiple HIVE tables. Our process reads the HIVE tables, creates a dataframe and then uses the dropDuplicates function to remove duplicates. I want to, after the process is done, to replicate that in HIVE for a reporting process. yes, the overhead is a mess because now we have two separate code bases but the main ETL process cannot have this functionality as it will slow it down too much. We need this information for reporting and will do it later on in the batch. Anyway, I need a SQL statement that will read the hive table and determine what the duplicate key values are and grab their counts. If, let's say, a particular key value has 9 records in the table, the deduped count for that value would be 8 (9-1 as we would always keep the parent record). And then run through the table and grab those counts:)
-- create a temp table with example values
CREATE TEMPORARY TABLE t1 (c1 string, c2 string, c3 string);
INSERT INTO TABLE t1
VALUES('a','b','c'),('a','g','c'),('a','b','c'),('b','a','c'),
('c','a','b'),('a','b','c'),('a','g','c'),('e','b','a');
-- count duplicates
SELECT
c1,
c2,
c3,
COUNT(c1)-1 AS dup_count
FROM t1
GROUP BY
c1,
c2,
c3
If you only want to output rows with duplicates:
SELECT *
FROM (
SELECT
c1,
c2,
c3,
COUNT(c1)-1 AS dup_count
FROM t1
GROUP BY
c1,
c2,
c3
) dups
WHERE dup_count > 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.