简体   繁体   English

Dataflow 上的 Apache Beam 不接受 BigQuery 查询的 ValueProvider

[英]Apache Beam on Dataflow Not Accepting ValueProvider for BigQuery Query

Goal目标

My goal is to create a Dataflow template that specifies an Apache Beam pipeline.我的目标是创建一个指定 Apache Beam 管道的 Dataflow 模板。 The pipeline runs in batch mode, reads from BigQuery, then performs transforms and writes elsewhere.管道以批处理模式运行,从 BigQuery 读取,然后在其他地方执行转换和写入。 Most importantly, the query I use for reading from BigQuery has to be Runtime provided.最重要的是,我用于从 BigQuery 读取的查询必须是运行时提供的。

Expected Behavior预期行为

The expected result is the pipeline will use the runtime parameter to specify the BigQuery query, execute the query, and then proceed with the rest of the pipeline.预期结果是管道将使用运行时参数指定 BigQuery 查询,执行查询,然后继续管道的其余部分。

Actual Behavior实际行为

The actual behavior is the runtime parameter I pass in is ignored, and instead, the parameter I had to specify when creating the GCS Template is used.实际行为是我传入的运行时参数被忽略,而是使用我在创建 GCS 模板时必须指定的参数。

Relevant Code相关代码

Below is how I specify the read operation, and how the query parameter is defined and passed in.下面是我如何指定读取操作,以及如何定义和传入查询参数。

public interface MyOptions extends PipelineOptions, StreamingOptions {
    @Description("Query String")
    ValueProvider<String> getQueryString();

    void setQueryString(ValueProvider<String> value);
}

public static void main(String[] args) {
        MyOptions options = PipelineOptionsFactory.fromArgs(args)
                .withValidation()
                .as(MyOptions.class);
        Pipeline p = Pipeline.create(options);

        PCollection<TableRow> tableRows =
                p.apply(BigQueryIO.readTableRows()
                        .fromQuery(options.getQueryString())
                        .withTemplateCompatibility()
                        .withoutValidation());
// Add this point I run my transformations and loading
}

To actually build the template and push to GCS, I do the following要实际构建模板并推送到 GCS,我执行以下操作

mvn compile -Pdataflow-runner exec:java -Dexec.mainClass=com.Pipeline "-Dexec.args=--runner=DataflowRunner --queryString='SELECT time,type FROM [my-project:timeseries.my-data] where time between TIMESTAMP(\"2020-02-13T00:00:00Z\") and TIMESTAMP(\"2020-02-15T00:00:00Z\")'"

Finally, I use the Dataflow Web UI to pick the Template from GCS and do a deploy.最后,我使用 Dataflow Web UI 从 GCS 中选择模板并进行部署。 At the bottom of the Web UI I specify my runtime parameters, where I set queryString and the runtime query I want to use.在 Web UI 的底部,我指定了我的运行时参数,在这里我设置了queryString和我想要使用的运行时查询。

Note: when I go to run the template in Dataflow, I specify queryString and I know for a fact it is being passed in. I rewrote my first transform to print out queryString and it correctly prints the specified runtime option.注意:当我在 Dataflow 中运行模板时,我指定了queryString并且我知道它被传入的事实。我重写了我的第一个转换以打印出queryString并且它正确地打印了指定的运行时选项。 The problem is the "read from BigQuery" queryString is still the original one used when I made the template.问题是“从 BigQuery 读取” queryString仍然是我制作模板时使用的原始查询字符串

After many iterations, I figured out the problem.经过多次迭代,我找到了问题所在。 There were actually 2, the largest being I did not need to actually pass the runtime parameter into the "build template" step.实际上有 2 个,最大的是我不需要将运行时参数实际传递到“构建模板”步骤中。

  1. Do not pass the runtime parameter when building the pipeline.构建管道时不要传递运行时参数。 It seems obvious, but drop that from the mvn compile args看起来很明显,但是从mvn compile args删除它
  2. Formatting the queryString as a runtime parameter was difficult.将 queryString 格式化为运行时参数很困难。 Below worked for me after many iterations经过多次迭代后,下面对我有用
SELECT time,type FROM `my-project.timeseries.my-data` where time between TIMESTAMP(\"2019-02-13T00:00:00Z\") and TIMESTAMP(\"2020-02-15T00:00:00Z\")

Note the lack of quotes around the entire parameter and how the projectId.dataset.tableId was formatted.请注意整个参数周围缺少引号以及 projectId.dataset.tableId 的格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM