SQL - Remove Duplicates based on 2 values

Question

I have 2 Postgresql databases with the same table structure. For reporting purposes, I'm pushing these table's data into a single Google BigQuery table.

On the reporting table, I have a column called databaseID which indicates what is the source database.

databaseID - 1(1st postgres table)
databaseID - 2 (2nd postgres table)

Now everytime I'm appending the incremental data to the reporting table. So it has duplicates for both data sources.

Example data on Reporting table:

id  name    DatabaseID  updated_date
1   aaa         1        2020-12-01
2   ccc         1        2020-12-01
1   vvv         1        2021-01-05
1   qqq         2        2020-12-01
2   www         2        2020-12-01
2   aaa         2        2021-01-05
3   xxx         2        2020-12-01

I have to de-duplicate this data for both the database IDs. I'm not sure about the SQL logic for this.

Expected output - after deduplicate:

id  name    DatabaseID  updated_date
2   ccc         1        2020-12-01
1   vvv         1        2021-01-05
1   qqq         2        2020-12-01
2   aaa         2        2021-01-05
3   xxx         2        2020-12-01

Answer 1

Could you please try something like this:

WITH CTE(ID,NAME,DATABASEID,UPDATED_DATE) AS
 (
    SELECT 1,'AAA',1,'2020-12-01'
       UNION ALL
    SELECT 2,'CCC',1,'2020-12-01'
       UNION ALL
    SELECT 1,'VVV',1,'2021-01-05'
       UNION ALL
   SELECT 1,'QQQ',2,'2020-12-01'
       UNION ALL
   SELECT 2,'WWW',2,'2020-12-01'
      UNION ALL
   SELECT 2,'AAA',2,'2021-01-05'
      UNION ALL
   SELECT 3,'XXX',2,'2020-12-01'
)
 SELECT X.ID,X.NAME,X.DATABASEID,X.UPDATED_DATE FROM 
 (
    SELECT C.ID,C.NAME,C.DATABASEID,C.UPDATED_DATE,
        ROW_NUMBER()OVER(PARTITION BY C.ID,C.DATABASEID ORDER BY C.UPDATED_DATE DESC)XCOL
    FROM CTE AS C 
 )X WHERE X.XCOL=1;

Answer 2

Consider below option

#standardSQL
select as value array_agg(t order by updated_date desc limit 1)[offset(0)]
from `project.dataset.table` t
group by id, DatabaseID

for the sample data in your question - above returns

Answer 3

In BigQuery, a simple approach uses aggregation:

select array_agg(r order by updated_date desc limit 1)[ordinal(1)].*
from reporting r
group by id;

SQL - Remove Duplicates based on 2 values

Question

Example data on Reporting table:

Expected output - after deduplicate:

3 answers

solution1
0 ACCPTED 2021-01-14 09:12:29

solution2
0 2021-01-14 15:29:20

solution3
-1 2021-01-14 12:43:15

SQL - Remove Duplicates based on 2 values

Question

Example data on Reporting table:

Expected output - after deduplicate:

3 answers

solution1 0 ACCPTED 2021-01-14 09:12:29

solution2 0 2021-01-14 15:29:20

solution3 -1 2021-01-14 12:43:15

solution1
0 ACCPTED 2021-01-14 09:12:29

solution2
0 2021-01-14 15:29:20

solution3
-1 2021-01-14 12:43:15