简体   繁体   中英

What is the best practice for loading data into BigQuery table?

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND . Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.

then I WRITE_TRUNCATE my production_table_orders with query like:

select ...
from (
    SELECT  * , ROW_NUMBER() OVER
    (PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn 
    FROM `warehouse.stage_table_orders` )
where rn=1

Then the production_table_orders always contains the most updated version of each order.

This process is suppose to run every 3 minutes .

I'm wondering if this is the best practice. I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.

Suggestion?

We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid . Use a CTAS statement (to the table itself) as you cannot add partition after fact.

EDIT: use 2 tables and MERGE

Depending on your particular use case ie the number of fields that could be updated between old and new, you could use 2 tables, eg stage_table_orders for the imported records and final_table_orders as destination table and do a MERGE like so:

MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
   F.date_purchased = S.date_purchased
WHEN MATCHED THEN
  UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
  INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)    

Pro : efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.

Con : you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.

There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG , or EXISTS with DELETE or UNION ALL ,... Try them out and see which performs better for YOUR dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM