简体   繁体   English

对于数据流,哪个更好 BigqueryIO.write() 或 bigquery.insertAll() 方法

[英]which is better BigqueryIO.write() Or bigquery.insertAll() method for dataflow

我正在开发 Java 代码来从 GCS 读取记录并插入到 BQ 表中,从成本和性能的角度来看,这是更好的 BigqueryIO.write() 或 bigquery.insertAll() 方法

If you are using Dataflow, your preferred method should be using Beam's BigQueryIO - this class has a lot of knowledge encapsulated on the best way to handle errors and different methods to send data to BigQuery.如果您使用的是 Dataflow,您的首选方法应该是使用 Beam 的BigQueryIO - 这个类包含了很多关于处理错误的最佳方法和将数据发送到 BigQuery 的不同方法的知识。

The 2 methods you can choose with BigQueryIO.Write :您可以使用BigQueryIO.Write选择的两种方法:

FILE_LOADS :文件_负载

Use BigQuery load jobs to insert data.使用 BigQuery 加载作业插入数据。 Records will first be written to files, and these files will be loaded into BigQuery.记录将首先写入文件,然后这些文件将加载到 BigQuery 中。 This is the default method when the input is bounded.这是输入有界时的默认方法。 This method can be chosen for unbounded inputs as well, as long as a triggering frequency is also set using BigQueryIO.Write.withTriggeringFrequency.也可以为无界输入选择此方法,只要还使用 BigQueryIO.Write.withTriggeringFrequency 设置触发频率即可。 BigQuery has daily quotas on the number of load jobs allowed per day, so be careful not to set the triggering frequency too frequent. BigQuery 对每天允许的加载作业数量有每日配额,因此请注意不要将触发频率设置得太频繁。 For more information, see Loading Data from Cloud Storage.有关更多信息,请参阅从 Cloud Storage 加载数据。

STREAMING_INSERTS : STREAMING_INSERTS :

Use the BigQuery streaming insert API to insert data.使用 BigQuery 流式插入 API 插入数据。 This provides the lowest-latency insert path into BigQuery, and therefore is the default method when the input is unbounded.这提供了进入 BigQuery 的最低延迟插入路径,因此是输入无界时的默认方法。 BigQuery will make a strong effort to ensure no duplicates when using this path, however there are some scenarios in which BigQuery is unable to make this guarantee. BigQuery 在使用此路径时会努力确保没有重复项,但在某些情况下 BigQuery 无法做出此保证。 A query can be run over the output table to periodically clean these rare duplicates.可以在输出表上运行查询以定期清理这些罕见的重复项。 Alternatively, using the FILE_LOADS insert method does guarantee no duplicates, though the latency for the insert into BigQuery will be much higher.或者,使用 FILE_LOADS 插入方法确实可以保证没有重复,但插入 BigQuery 的延迟会高得多。 For more information, see Streaming Data into BigQuery.有关更多信息,请参阅将数据流式传输到 BigQuery。

BigQueryIO is preferable because it is part of Beam, and so the pipeline understands records being sent to BigQuery. BigQueryIO更可取,因为它是 Beam 的一部分,因此管道了解发送到 BigQuery 的记录。 This means that it can be monitored, retries are builtin etc. BigQueryIO.Write actually allows you to choose whether to use a load job or streaming inserts via the withMethod setting.这意味着它可以被监控,重试是内置的等等。 BigQueryIO.Write 实际上允许您通过withMethod设置选择是使用加载作业还是流式插入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Google Cloud Dataflow BigQueryIO.Write发生未知错误(http代码500) - Google Cloud Dataflow BigQueryIO.Write occur Unknown Error (http code 500) Apache Beam BigqueryIO.Write getSuccessfulInserts 不工作 - Apache Beam BigqueryIO.Write getSuccessfulInserts not working bigquery.tabledata.insertAll 方法的 API - API for bigquery.tabledata.insertAll method bigquery.tabledata()。insertAll方法(Bigquery Java API)仅插入部分行 - bigquery.tabledata().insertAll method (Bigquery Java API) inserting only partial rows 使用 insertAll 流式传输时 BigQuery 中的“复合键” - “composite key” in BigQuery when streaming with insertAll Apache beam:使用BigQueryIO更新BigQuery表行 - Apache beam : Update BigQuery table row with BigQueryIO Google Cloud Dataflow写入BigQuery(错误401:需要登录) - Google Cloud Dataflow write to BigQuery (Error 401 : Login Required) GCP Dataflow-从存储读取CSV文件并写入BigQuery - GCP Dataflow- read CSV file from Storage and write into BigQuery 如何使用流式 insertAll 在 BigQuery 中插入 38000 条记录? - How to insert 38000 records in BigQuery using streaming insertAll? 即使将区域设置为 BigQuery 数据集的区域,Dataflow 作业也无法写入不同区域的 BigQuery 数据集 - Dataflow job unable to write to BigQuery dataset in different region even if zone is set to the region of bigquery dataset
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM