[英]What is the best practice for loading data into BigQuery table?
Currently I'm loading data from Google Storage
to stage_table_orders
using WRITE_APPEND
. 目前,我正在使用
WRITE_APPEND
将数据从Google Storage
stage_table_orders
加载到stage_table_orders
中。 Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp
tells which row is the most updated one. 由于此订单同时加载新订单和现有订单,因此可能会出现同一订单具有多个版本的情况,字段
etl_timestamp
告诉哪一行是最新的。
then I WRITE_TRUNCATE
my production_table_orders
with query like: 然后我用以下查询
WRITE_TRUNCATE
我的production_table_orders
:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders
always contains the most updated version of each order. 然后
production_table_orders
始终包含每个订单的最新版本。
This process is suppose to run every 3 minutes . 假定此过程每3分钟运行一次 。
I'm wondering if this is the best practice. 我想知道这是否是最佳做法。 I have around 20M rows.
我大约有2000万行。 It seems not smart to
WRITE_TRUNCATE
20M rows every 3 minutes. 每3分钟
WRITE_TRUNCATE
20M行似乎WRITE_TRUNCATE
。
Suggestion? 建议?
We are doing the same. 我们也一样。 To help improve performance though, try to partition the table by
date_purchased
and cluster by orderid
. 但是,为了帮助提高性能,请尝试按
date_purchased
对表进行分区, date_purchased
orderid
集群进行分区。 Use a CTAS statement (to the table itself) as you cannot add partition after fact. 使用CTAS语句(对表本身),因为事后您无法添加分区。
EDIT: use 2 tables and MERGE
编辑:使用2表和合并
Depending on your particular use case ie the number of fields that could be updated between old and new, you could use 2 tables, eg stage_table_orders
for the imported records and final_table_orders
as destination table and do a MERGE
like so: 根据您的特定用例,即可以在新旧之间更新的字段数,您可以使用2个表,例如,
stage_table_orders
用于导入的记录,而final_table_orders
作为目标表,并执行MERGE
如下所示:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro : efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work. Pro :如果“插入”了几行,效率很高,但不是数百万(尽管未经测试)+修剪分区应该可以工作。
Con : you have to explicitly list the fields in the update and insert clauses. 缺点 :您必须在update和insert子句中显式列出字段。 A one-time effort if schema is pretty much fixed.
如果架构几乎是固定的,则需要进行一次努力。
There are may ways to de-duplicate and there is no one-size-fits-all. 有可能要进行重复数据删除的方法,并且没有一刀切的功能。 Search in SO for similar requests using
ARRAY_AGG
, or EXISTS
with DELETE
or UNION ALL
,... Try them out and see which performs better for YOUR dataset. 使用
ARRAY_AGG
在SO中搜索类似的请求,或者使用DELETE
或UNION ALL
EXISTS
搜索...,尝试一下,看看哪种方法对您的数据集效果更好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.