简体   繁体   English

在数据流中从BigQuery读取时设置maximumBillingTier

[英]Set maximumBillingTier when reading from BigQuery in Dataflow

I'm running GCP Dataflow job when I'm reading data from BigQuery as a query result. 从BigQuery读取数据作为查询结果时,我正在运行GCP Dataflow作业。 I'm using google-cloud-dataflow-java-sdk-all version 1.9.0. 我正在使用google-cloud-dataflow-java-sdk-all版本1.9.0。 The code fragment that sets up the pipeline looks like this: 设置管道的代码片段如下所示:

PCollection<TableRow> myRows = pipeline.apply(BigQueryIO.Read
            .fromQuery(query)
            .usingStandardSql()
            .withoutResultFlattening()
            .named("Input " + tableId)
    );

The query is quite complex what results in error message: 查询非常复杂,导致产生错误消息:

Query exceeded resource limits for tier 1. Tier 8 or higher required., error: Query exceeded resource limits for tier 1. Tier 8 or higher required. 查询超出了对第1层的资源限制。,错误:查询超出了对第1层的资源限制。

I'd like to set maximumBillingTier as it is done in Web UI or in bq script. 我想设置maximumBillingTier因为它是在Web UI或bq脚本中完成的。 I can't find any way to do so except for setting default for the entire project which is unfortunately not an option. 除了为整个项目设置默认值之外,我找不到任何其他方法,不幸的是这不是一个选择。

I tried to set it through these without success: 我试图通过这些设置,但没有成功:

  • DataflowPipelineOptions - neither this nor any interface it extends seems to have that setting DataflowPipelineOptions-此接口或其扩展的接口都没有该设置
  • BigQueryIO.Read.Bound - I would expect it to be there just next to usingStandardSql and others similar but obviously it is not there BigQueryIO.Read.Bound-我希望它就在usingStandardSql和其他类似的代码旁边,但显然它不存在
  • JobConfigurationQuery - this class has all cool settings but it seems it is not used at all when setting up a pipeline JobConfigurationQuery-此类具有所有不错的设置,但似乎在设置管道时根本不使用它

Is there any way to pass this setting from within Dataflow job? 有什么方法可以从Dataflow作业中传递此设置吗?

Maybe a Googler will correct me, but it looks like you are right. 也许Google员工会纠正我,但看来您是对的。 I can't see this parameter exposed either. 我也看不到此参数。 I checked both the Dataflow and the Beam APIs. 我检查了数据流Beam API。

Under the hood, Dataflow is using JobConfigurationQuery from the BigQuery API, but it simply doesn't expose that parameter through its own API. JobConfigurationQuery ,Dataflow正在使用BigQuery API中的JobConfigurationQuery ,但它只是不通过其自己的API公开该参数。

One workaround I see is to first run your complex query using the BigQuery API directly - before dropping into your pipeline. 我看到的一种解决方法是先直接使用BigQuery API运行复杂的查询-然后再进入管道。 That way you can set the max billing tier through the JobConfigurationQuery class. 这样,您可以通过JobConfigurationQuery类设置最大计费层。 Write the results of that query to another table in BigQuery. 将该查询的结果写入BigQuery中的另一个表。

Then finally, in your pipeline, just read in the table which was created from the complex query. 最后,在您的管道中,只需读取通过复杂查询创建的表即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM