How do you deduplicate records in a BigQuery table?

Question

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery. The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time The cron timer was * * 3 * * * instead of 00 3 * * *

How can we fix the table? Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large

Any help would be much appreciated

Answer 1

I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.

Option One

If this is a one-off fix, I recommend you simply

navigate to the table ( your_dataset.your_table ) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (eg your_dataset.your_table_deduplicated )
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (ie call the copy your_dataset.your_table )
delete your_dataset.your_table_deduplicated

This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.

Option Two

A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).

There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.

There is a simpler approach, which is equivalent to option one and just requires you to run this query:

CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table

Again, you may wish to take a snapshot before running this.

The Future

If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (ie doesn't matter how many times you run it, if the input is the same the output is the same).

A typical pattern would be to add a stage to your function to pre-filter the new records.

Depending on your requirements, this stage could

prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, eg in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet

This will prevent the sort of issues you have encountered here.

How do you deduplicate records in a BigQuery table?

Question

1 answers

solution1
0 2023-01-11 15:15:48

Option One

Option Two

The Future

How do you deduplicate records in a BigQuery table?

Question

1 answers

solution1 0 2023-01-11 15:15:48

Option One

Option Two

The Future

solution1
0 2023-01-11 15:15:48