简体   繁体   中英

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery. The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time The cron timer was * * 3 * * * instead of 00 3 * * *

How can we fix the table? Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large

Any help would be much appreciated

I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.

Option One

If this is a one-off fix, I recommend you simply

  1. navigate to the table ( your_dataset.your_table ) in the UI
  2. click 'snapshot' and create a snapshot in case you make a mistake in the next part
  3. run SELECT DISTINCT * FROM your_dataset.your_table in the UI
  4. click 'save results' and select 'bigquery table' then save as a new table (eg your_dataset.your_table_deduplicated )
  5. navigate back to the old table and click the 'delete' button, then authorise the deletion
  6. navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (ie call the copy your_dataset.your_table )
  7. delete your_dataset.your_table_deduplicated

This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.

Option Two

A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).

There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.

There is a simpler approach, which is equivalent to option one and just requires you to run this query:

CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table

Again, you may wish to take a snapshot before running this.

The Future

If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (ie doesn't matter how many times you run it, if the input is the same the output is the same).

A typical pattern would be to add a stage to your function to pre-filter the new records.

Depending on your requirements, this stage could

  • prepare the new records you want to insert, which should have some unique, immutable ID field
  • SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
  • filter the new records, eg in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
  • upload only the records that don't exist yet

This will prevent the sort of issues you have encountered here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM