简体   繁体   中英

Dopping duplicates from HIVE table, need to write out the dropped records and grab count

So we have an ETL process that pulls data from multiple HIVE tables. Our process reads the HIVE tables, creates a dataframe and then uses the dropDuplicates function to remove duplicates. I want to, after the process is done, to replicate that in HIVE for a reporting process. yes, the overhead is a mess because now we have two separate code bases but the main ETL process cannot have this functionality as it will slow it down too much. We need this information for reporting and will do it later on in the batch. Anyway, I need a SQL statement that will read the hive table and determine what the duplicate key values are and grab their counts. If, let's say, a particular key value has 9 records in the table, the deduped count for that value would be 8 (9-1 as we would always keep the parent record). And then run through the table and grab those counts:)

-- create a temp table with example values

CREATE TEMPORARY TABLE t1 (c1 string, c2 string, c3 string);
INSERT INTO TABLE t1
    VALUES('a','b','c'),('a','g','c'),('a','b','c'),('b','a','c'),
          ('c','a','b'),('a','b','c'),('a','g','c'),('e','b','a');

-- count duplicates

SELECT
        c1,
        c2,
        c3,
        COUNT(c1)-1 AS dup_count
FROM t1
GROUP BY
        c1,
        c2,
        c3

If you only want to output rows with duplicates:

SELECT *
FROM (
SELECT
        c1,
        c2,
        c3,
        COUNT(c1)-1 AS dup_count
FROM t1
GROUP BY
        c1,
        c2,
        c3
) dups
WHERE dup_count > 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM