简体繁体 English

对于数据流，哪个更好 BigqueryIO.write() 或 bigquery.insertAll() 方法

[英]which is better BigqueryIO.write() Or bigquery.insertAll() method for dataflow

原文 2019-03-04 18:08:03 1 2 google-cloud-platform/ google-bigquery/ google-cloud-dataflow/ dataflow

我正在开发 Java 代码来从 GCS 读取记录并插入到 BQ 表中，从成本和性能的角度来看，这是更好的 BigqueryIO.write() 或 bigquery.insertAll() 方法

2 个解决方案

If you are using Dataflow, your preferred method should be using Beam's BigQueryIO - this class has a lot of knowledge encapsulated on the best way to handle errors and different methods to send data to BigQuery.如果您使用的是 Dataflow，您的首选方法应该是使用 Beam 的BigQueryIO - 这个类包含了很多关于处理错误的最佳方法和将数据发送到 BigQuery 的不同方法的知识。

The 2 methods you can choose with BigQueryIO.Write :您可以使用BigQueryIO.Write选择的两种方法：

FILE_LOADS :文件_负载：

Use BigQuery load jobs to insert data.使用 BigQuery 加载作业插入数据。 Records will first be written to files, and these files will be loaded into BigQuery.记录将首先写入文件，然后这些文件将加载到 BigQuery 中。 This is the default method when the input is bounded.这是输入有界时的默认方法。 This method can be chosen for unbounded inputs as well, as long as a triggering frequency is also set using BigQueryIO.Write.withTriggeringFrequency.也可以为无界输入选择此方法，只要还使用 BigQueryIO.Write.withTriggeringFrequency 设置触发频率即可。 BigQuery has daily quotas on the number of load jobs allowed per day, so be careful not to set the triggering frequency too frequent. BigQuery 对每天允许的加载作业数量有每日配额，因此请注意不要将触发频率设置得太频繁。 For more information, see Loading Data from Cloud Storage.有关更多信息，请参阅从 Cloud Storage 加载数据。

STREAMING_INSERTS : STREAMING_INSERTS :

Use the BigQuery streaming insert API to insert data.使用 BigQuery 流式插入 API 插入数据。 This provides the lowest-latency insert path into BigQuery, and therefore is the default method when the input is unbounded.这提供了进入 BigQuery 的最低延迟插入路径，因此是输入无界时的默认方法。 BigQuery will make a strong effort to ensure no duplicates when using this path, however there are some scenarios in which BigQuery is unable to make this guarantee. BigQuery 在使用此路径时会努力确保没有重复项，但在某些情况下 BigQuery 无法做出此保证。 A query can be run over the output table to periodically clean these rare duplicates.可以在输出表上运行查询以定期清理这些罕见的重复项。 Alternatively, using the FILE_LOADS insert method does guarantee no duplicates, though the latency for the insert into BigQuery will be much higher.或者，使用 FILE_LOADS 插入方法确实可以保证没有重复，但插入 BigQuery 的延迟会高得多。 For more information, see Streaming Data into BigQuery.有关更多信息，请参阅将数据流式传输到 BigQuery。

BigQueryIO is preferable because it is part of Beam, and so the pipeline understands records being sent to BigQuery. BigQueryIO更可取，因为它是 Beam 的一部分，因此管道了解发送到 BigQuery 的记录。 This means that it can be monitored, retries are builtin etc. BigQueryIO.Write actually allows you to choose whether to use a load job or streaming inserts via the withMethod setting.这意味着它可以被监控，重试是内置的等等。 BigQueryIO.Write 实际上允许您通过withMethod设置选择是使用加载作业还是流式插入。