使用 insertAll 流式传输时 BigQuery 中的“复合键”

Question

I'm streaming data into a BigQuery table building an InsertAllRequest which is then inserted using the insertAll -method from com.google.cloud.bigquery.BigQuery .我将数据流式传输到构建InsertAllRequest的 BigQuery 表中，然后使用insertAll中的com.google.cloud.bigquery.BigQuery方法插入该表。 I git it all to work in the sense that I can insert data into the table but I'm out for a specific behavior: I'd like to implement some kind of a "composite key" in the table.我 git 这一切都可以在我可以将数据插入表中的意义上工作，但我不适合特定行为：我想在表中实现某种“复合键”。

Here's what the table looks like:表格如下所示：

Field name      | Type      | Mode
--------------------------------------
order_id        | STRING    | REQUIRED
modified_ts     | TIMESTAMP | REQUIRED
order_sum       | INTEGER   | NULLABLE
order_reference | STRING    | NULLABLE

So, I'd like the key to be order_id and modified_ts ;所以，我希望关键是order_id和modified_ts ； with other words, I'd like to be able to track changes of an order over time.换句话说，我希望能够跟踪订单随时间的变化。 If an existing key is inserted again, I'd hope for some error - or just ignoring this new row (regarding it as a duplicate) would work fine for me as well.如果再次插入现有密钥，我希望出现一些错误 - 或者只是忽略这个新行（将其视为重复）对我来说也可以正常工作。

Unfortunately, I didn't yet succeed in telling BigQuery to do so.不幸的是，我还没有成功地告诉 BigQuery 这样做。 Here's the code I tested:这是我测试的代码：

String rowId = String.valueOf("order_id, modified_ts");

InsertAllRequest req = InsertAllRequest.newBuilder(ORDER)
        .addRow(rowId, mapOrder(o, modifiedTs))
        .build();

InsertAllResponse resp = bigQuery.insertAll(req);
log.info("response was: {}", resp.toString());

ORDER in newBuilder is a TableId -object and mapOrder(o, modifiedTs) maps the incoming object to a Map<String, Object> . newBuilder中的 ORDER 是一个TableId ，而mapOrder(o, modifiedTs)将传入的 object 映射到一个Map<String, Object> 。 All works fine if I define rowId as String.valueOf("order_id") but obviously all updates of an order just update the existing row, not generating any history.如果我将rowId定义为String.valueOf("order_id")则一切正常，但显然订单的所有更新只是更新现有行，而不生成任何历史记录。 The solution above with comma-separated column-names behaves the same way, simply ignoring modified_ts .上面使用逗号分隔的列名的解决方案的行为方式相同，只是忽略了modified_ts 。

So, my question is simply: how can I get this to work?所以，我的问题很简单：我怎样才能让它工作？ What I want is - somewhat simplified - the following:我想要的是 - 有点简化 - 以下内容：

order_id | modified_ts | data
------------------------------------------
    1    | 2020-12-10  | some data
    1    | 2020-12-15  | some changed data
    2    | 2020-12-15  | some more data

Answer 1

The composite key or UNIQUE concept doesn't exists in BigQuery. BigQuery 中不存在复合键或 UNIQUE 概念。 There are no keys and indexes.没有键和索引。

Engineer your app so that allows duplicates to be inserted.设计您的应用程序，以便允许插入重复项。
On top of your table create a view, that reads the most recent row of the record , based on the concept you already laid out.在您的表格顶部创建一个视图，该视图根据您已经布置的概念读取记录的最新行。

This way you have access to versioned data as well, and always you have the up to date version using the view as from clause in a query.这样，您也可以访问版本化数据，并且始终使用查询中的 view as from 子句获得最新版本。

Answer 2

As written in the comment on Pentium 10s answer, a meeting with a Google representative confirmed its content.正如对 Pentium 10 答案的评论中所写，与 Google 代表的一次会议证实了其内容。

Basically, I misunderstood the functionality of adding a "rowId" to my row, indicating its key: String rowId = String.valueOf("order_id, modified_ts");基本上，我误解了向我的行添加“rowId”的功能，表明它的键： String rowId = String.valueOf("order_id, modified_ts"); This is nothing more than what Google calls "Best effort de-duplication" and it's just that - a best effort and no guarantee whatsoever.这只不过是 Google 所说的“尽力而为的重复数据删除” ，它就是这样 - 尽力而为，但没有任何保证。 I mistook this as a technique to rely on, my bad.我误以为这是一种可以依赖的技术，我的错。

The recommended way to deal with this is in your own code, either before or after streaming into BigQuery.处理此问题的推荐方法是在您自己的代码中，在流式传输到 BigQuery 之前或之后。 "Before" would mean implementing logic in your app handling duplicates before writing data into BQ which includes some way of keeping what you identify as keys in memory. “之前”意味着在将数据写入 BQ 之前在您的应用程序中处理重复项中实现逻辑，其中包括将您标识为 memory 中的键的某些方法。 "After" is what Pentium 10 suggests: stream all the data into BigQuery and persist it, take care of the rest then. Pentium 10 建议“之后”：stream 将所有数据放入 BigQuery 并将其持久化，然后处理 rest。

There are 3 ways to solve this problem "after": Views with (the very handy,) window-functions may be a way (but remember that the processing power of the whole underlying query is needed every time you query the view), materialized views might be a solution (if/when Google supports window functions in those) or you create and update a table with the desired data yourself.有 3 种“之后”解决这个问题的方法：带有（非常方便的）窗口函数的视图可能是一种方法（但请记住，每次查询视图时都需要整个底层查询的处理能力），物化视图可能是一种解决方案（如果/当谷歌支持 window 函数）或者您自己创建和更新具有所需数据的表。 managing some king of scheduling.管理一些调度之王。

I hope this answer helps clear up things a bit and serves as a complement to the provided one.我希望这个答案有助于澄清一些事情，并作为对所提供答案的补充。

使用 insertAll 流式传输时 BigQuery 中的“复合键”

问题描述

2 个解决方案

解决方案1
1 2020-12-15 09:07:37

解决方案2
1 2020-12-18 08:18:53

使用 insertAll 流式传输时 BigQuery 中的“复合键”

问题描述

2 个解决方案

解决方案1 1 2020-12-15 09:07:37

解决方案2 1 2020-12-18 08:18:53

解决方案1
1 2020-12-15 09:07:37

解决方案2
1 2020-12-18 08:18:53