We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery. The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table? Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
If this is a one-off fix, I recommend you simply
your_dataset.your_table
) in the UISELECT DISTINCT * FROM your_dataset.your_table
in the UIyour_dataset.your_table_deduplicated
)your_dataset.your_table
)your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE
clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (ie doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
SELECT some_unique_id FROM your_dataset.your_table
-> old_record_ids
new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
This will prevent the sort of issues you have encountered here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.