将数据加载到BigQuery表中的最佳做法是什么？

Question

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND . 目前，我正在使用WRITE_APPEND将数据从Google Storage stage_table_orders加载到stage_table_orders中。 Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one. 由于此订单同时加载新订单和现有订单，因此可能会出现同一订单具有多个版本的情况，字段etl_timestamp告诉哪一行是最新的。

then I WRITE_TRUNCATE my production_table_orders with query like: 然后我用以下查询WRITE_TRUNCATE我的production_table_orders ：

select ...
from (
    SELECT  * , ROW_NUMBER() OVER
    (PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn 
    FROM `warehouse.stage_table_orders` )
where rn=1

Then the production_table_orders always contains the most updated version of each order. 然后production_table_orders始终包含每个订单的最新版本。

This process is suppose to run every 3 minutes . 假定此过程每3分钟运行一次。

I'm wondering if this is the best practice. 我想知道这是否是最佳做法。 I have around 20M rows. 我大约有2000万行。 It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes. 每3分钟WRITE_TRUNCATE 20M行似乎WRITE_TRUNCATE 。

Suggestion? 建议？

Answer 1

We are doing the same. 我们也一样。 To help improve performance though, try to partition the table by date_purchased and cluster by orderid . 但是，为了帮助提高性能，请尝试按date_purchased对表进行分区， date_purchased orderid集群进行分区。 Use a CTAS statement (to the table itself) as you cannot add partition after fact. 使用CTAS语句（对表本身），因为事后您无法添加分区。

EDIT: use 2 tables and MERGE 编辑：使用2表和合并

Depending on your particular use case ie the number of fields that could be updated between old and new, you could use 2 tables, eg stage_table_orders for the imported records and final_table_orders as destination table and do a MERGE like so: 根据您的特定用例，即可以在新旧之间更新的字段数，您可以使用2个表，例如， stage_table_orders用于导入的记录，而final_table_orders作为目标表，并执行MERGE如下所示：

MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
   F.date_purchased = S.date_purchased
WHEN MATCHED THEN
  UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
  INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)

Pro : efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work. Pro ：如果“插入”了几行，效率很高，但不是数百万（尽管未经测试）+修剪分区应该可以工作。

Con : you have to explicitly list the fields in the update and insert clauses. 缺点：您必须在update和insert子句中显式列出字段。 A one-time effort if schema is pretty much fixed. 如果架构几乎是固定的，则需要进行一次努力。

There are may ways to de-duplicate and there is no one-size-fits-all. 有可能要进行重复数据删除的方法，并且没有一刀切的功能。 Search in SO for similar requests using ARRAY_AGG , or EXISTS with DELETE or UNION ALL ,... Try them out and see which performs better for YOUR dataset. 使用ARRAY_AGG在SO中搜索类似的请求，或者使用DELETE或UNION ALL EXISTS搜索...，尝试一下，看看哪种方法对您的数据集效果更好。

将数据加载到BigQuery表中的最佳做法是什么？

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-10-11 19:11:21

将数据加载到BigQuery表中的最佳做法是什么？

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-10-11 19:11:21

解决方案1
2 已采纳 2018-10-11 19:11:21