简体繁体 English

用于数据屏蔽/标记化的数据流流模板给出不一致的结果

[英]Dataflow streaming template for data masking/tokenization giving inconsistent results

原文 2023-01-10 22:47:16 8 2 templates/ streaming/ google-cloud-dataflow/ google-cloud-dlp

The Google provided Dataflow Streaming template for data masking/tokenization from cloud storage to bigquery using cloud DLP is giving inconsistent output for each source files. Google 提供的 Dataflow Streaming 模板用于使用云 DLP 从云存储到 bigquery 的数据屏蔽/标记化为每个源文件提供不一致的 output。

We have 100 odd files with 1M records each in the GCS bucket and we are calling the dataflow streaming template to tokenize the data using DLP and load into BigQuery.我们在 GCS 存储桶中有 100 个奇怪的文件，每个文件有 100 万条记录，我们正在调用数据流流模板以使用 DLP 标记数据并加载到 BigQuery 中。

While loading the files sequentially we saw that the results are inconsistent在按顺序加载文件时，我们看到结果不一致

For few files full 1M got loaded but for most of them the rows are varied between 0.98M to 0.99M.对于少数文件，完整的 1M 被加载，但对于大多数文件，行在 0.98M 到 0.99M 之间变化。 Is there any reason for such behaviour?这种行为有什么理由吗？

2 个解决方案

I am not sure but it's maybe due to BigQuery best-effort deduplication mechanism used for streaming data to BigQuery :我不确定，但这可能是由于用于将数据流式传输到BigQuery的BigQuery best-effort deduplication mechanism ：

From the Beam documentation:来自Beam文档：

Note: Streaming inserts by default enables BigQuery best-effort deduplication mechanism.注意：默认情况下，流式插入会启用 BigQuery 尽力去重机制。 You can disable that by setting ignoreInsertIds.您可以通过设置 ignoreInsertIds 来禁用它。 The quota limitations are different when deduplication is enabled vs. disabled:启用和禁用重复数据删除时的配额限制不同：

Streaming inserts applies a default sharding for each table destination.流式插入为每个表目标应用默认分片。 You can use withAutoSharding (starting 2.28.0 release) to enable dynamic sharding and the number of shards may be determined and changed at runtime.您可以使用 withAutoSharding（从 2.28.0 版本开始）启用动态分片，并且分片的数量可以在运行时确定和更改。 The sharding behavior depends on the runners.分片行为取决于跑步者。

From the Google Cloud documentation:来自谷歌云文档：

Best effort de-duplication When you supply insertId for an inserted row, BigQuery uses this ID to support best effort de-duplication for up to one minute.尽力而为重复数据删除当您为插入的行提供 insertId 时，BigQuery 使用此 ID 来支持最多一分钟的尽力而为重复数据删除。 That is, if you stream the same row with the same insertId more than once within that time period into the same table, BigQuery might de-duplicate the multiple occurrences of that row, retaining only one of those occurrences.也就是说，如果您在该时间段内多次将具有相同 insertId 的同一行 stream 插入到同一个表中，BigQuery 可能会对该行的多次出现进行去重，仅保留其中一次出现。

The system expects that rows provided with identical insertIds are also identical.系统期望提供有相同 insertIds 的行也是相同的。 If two rows have identical insertIds, it is nondeterministic which row BigQuery preserves.如果两行具有相同的 insertId，则 BigQuery 保留哪一行是不确定的。

De-duplication is generally meant for retry scenarios in a distributed system where there's no way to determine the state of a streaming insert under certain error conditions, such as.network errors between your system and BigQuery or internal errors within BigQuery.重复数据删除通常用于分布式系统中的重试场景，在这种情况下，在某些错误情况下无法确定流式插入的 state，例如系统与 BigQuery 之间的网络错误或 BigQuery 内部错误。 If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data.如果您重试插入，请对同一组行使用相同的 insertId，以便 BigQuery 可以尝试删除重复数据。 For more information, see troubleshooting streaming inserts.有关详细信息，请参阅对流式插入进行故障排除。

De-duplication offered by BigQuery is best effort, and it should not be relied upon as a mechanism to guarantee the absence of duplicates in your data. BigQuery 提供的重复数据删除是尽力而为，不应依赖它作为保证数据中不存在重复项的机制。 Additionally, BigQuery might degrade the quality of best effort de-duplication at any time in order to guarantee higher reliability and availability for your data.此外，BigQuery 可能会随时降低最大努力重复数据删除的质量，以保证您的数据具有更高的可靠性和可用性。

If you have strict de-duplication requirements for your data, Google Cloud Datastore is an alternative service that supports transactions.如果您对数据有严格的重复数据删除要求，Google Cloud Datastore 是一种支持事务的替代服务。