简体   繁体   English

在 GCP Console 上创建 Dataflow 作业期间提供参数时出错

[英]Error when providing arguments during Dataflow Job creation on GCP Console

Since 2021 October 05/06, my GCP Dataflow template files is getting the arguments' values provided during template creation (when I run the .py file in my local machine in order to create the template file on GCP storage) and not getting the arguments provided during job creation based on this same template file.自 2021 年 10 月 06 日起,我的 GCP Dataflow 模板文件正在获取模板创建期间提供的参数值(当我在本地机器上运行 .py 文件以在 GCP 存储上创建模板文件时)并且没有获取参数在基于此相同模板文件的作业创建期间提供。 If I don't provide any value during template creation they assume a RuntimeValueProvider (when not using default values for args), but not the values provided during job creation.如果我在模板创建期间不提供任何值,它们会假定一个 RuntimeValueProvider(当不使用 args 的默认值时),而不是在作业创建期间提供的值。

The arguments provided during the job creation get stored in the Dataflow job session.创建作业期间提供的参数存储在 Dataflow 作业会话中。 If I open the job, go to right-side bar and open "pipeline options", so the correct values provided during job creation are there but they don't reach the code.如果我打开作业,请转到右侧栏并打开“管道选项”,这样在创建作业期间提供的正确值就在那里,但它们没有到达代码。

I'm running my code from template in a classic way in the GCP console:我在 GCP 控制台中以经典方式从模板运行我的代码:

gcloud dataflow jobs run JOB_NAME --gcs-location gs://LOCATION/TEMPLATE/FILE --region REGION --project PROJ_NAME --worker-machine-type MACHINE_TYPE --parameters PARAM_1=PARAM_1_VALUE,PARAM_2=PARAM_2_VALUE

I'm using SDK 2.32.0, and inside code I'm using "parser.add_value_provider_argument" and not "parser.add_argument".我使用的是 SDK 2.32.0,在代码内部我使用的是“parser.add_value_provider_argument”而不是“parser.add_argument”。 But I tested it using "parser.add_argument" and I got no success.但是我使用“parser.add_argument”对其进行了测试,但没有成功。 With both, my code is assuming the arguments values from when I run the .py file.对于这两种情况,我的代码假设我运行 .py 文件时的参数值。

Example 1示例 1

import apache_beam.io.gcp.gcsfilesystem as gcs
from apache_beam.options.pipeline_options import PipelineOptions
class MyOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument('--PARAM_1', 
            type=str)
        parser.add_value_provider_argument('--PARAM_2', 
            type=str)
beam_options = PipelineOptions()
args = beam_options.view_as(MyOptions)

# Some business operations with args that are always assuming the  values provided during template creation
options = {'project': PROJECT,
           'runner': 'DataflowRunner',
           'region': REGION,
           'staging_location': 'gs://{}/temp'.format(BUCKET),
           'temp_location': 'gs://{}/temp'.format(BUCKET),
           'template_location': 'gs://{}/template/batch_poc'.format(BUCKET)}
pipeline_options = PipelineOptions.from_dictionary(options)

with beam.Pipeline(options = pipeline_options) as p:
    lines = (p
            | beam...
            )

Example 2 (same as example 1 but using default values)示例 2(与示例 1 相同,但使用默认值)

# ... same as example 1
class MyOptions(PipelineOptions):
        @classmethod
        def _add_argparse_args(cls, parser):
            parser.add_value_provider_argument('--PARAM_1',
                default="test1",
                type=str)
            parser.add_value_provider_argument('--PARAM_2', 
                default="test2",
                type=str)
# ... same as example 1

In all cases my parameters provided during job creation are ignored.在所有情况下,我在创建作业期间提供的参数都被忽略。

Case 1: When running example 1 with no args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below).案例 1:当在本地机器上运行没有 args 的示例 1(作为下面的 python 命令)并在 GCP 控制台上运行其模板时,两种情况:args 和没有 args(作为下面的第二个命令)。 The value in PARAM_1_VALUE and PARAM_2_VALUE are the same: RuntimeValueProvider(...) PARAM_1_VALUE 和 PARAM_2_VALUE 中的值相同: RuntimeValueProvider(...)

LOCALHOST> python3 code.py

GCP> gcloud dataflow jobs run ...
OR
GCP> gcloud dataflow jobs run ... --parameters PARAM_1=another_test_1,PARAM_2=another_test_2

Case 2: When running example 1 with args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below).案例 2:当在本地机器上使用 args 运行示例 1(作为下面的 python 命令)并在 GCP 控制台上运行其模板时,两种情况:args 和没有 args(作为下面的第二个命令)。 The value in PARAM_1_VALUE and PARAM_2_VALUE are the same values passed during template creation: another_test_{value} and not another_another_test_{value} PARAM_1_VALUE 和 PARAM_2_VALUE 中的值与模板创建期间传递的值相同:another_test_{value} 而不是 another_another_test_{value}

LOCALHOST> python3 code.py --PARAM_1 another_test_1 --PARAM_2 another_test_2

GCP> gcloud dataflow jobs run ...
OR
GCP> gcloud dataflow jobs run ... --parameters PARAM_1=another_another_test_1,PARAM_2=another_another_test_2

Case 3: When running example 2 with no args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below).案例 3:当在本地机器上运行没有 args 的示例 2(作为下面的 python 命令)并在 GCP 控制台上运行其模板时,两种情况:args 和没有 args(作为下面的第二个命令)。 The value in PARAM_1_VALUE and PARAM_2_VALUE are the default value. PARAM_1_VALUE 和 PARAM_2_VALUE 中的值是默认值。

LOCALHOST> python3 code.py

GCP> gcloud dataflow jobs run ...
OR
GCP> gcloud dataflow jobs run ... --parameters PARAM_1=another_test_1,PARAM_2=another_test_2

Case 4: When running example 2 with args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below).案例 4:当在本地机器上使用 args 运行示例 2(作为下面的 python 命令)并在 GCP 控制台上运行其模板时,两种情况:args 和没有 args(作为下面的第二个命令)。 It happens the same as case 2.它的发生与情况 2 相同。

Note: I updated both libs: apache-beam and apache-beam[gcp]注意:我更新了两个库:apache-beam 和 apache-beam[gcp]

Note that the "--PARAM_1_VALUE", "--PARAM_1_VALUE"... values cannot be used during pipeline construction.请注意,在管道构建期间不能使用“--PARAM_1_VALUE”、“--PARAM_1_VALUE”...值。 As per 1 :根据1

“RuntimeValueProvider is the default ValueProvider type. “RuntimeValueProvider 是默认的 ValueProvider 类型。 RuntimeValueProvider allows your pipeline to accept a value that is only available during pipeline execution. RuntimeValueProvider 允许您的管道接受仅在管道执行期间可用的值。 The value is not available during pipeline construction, so you can't use the value to change your pipeline's workflow graph.”该值在管道构建期间不可用,因此您无法使用该值来更改管道的工作流图。”

The documentation shows using the .get() method on the ValueProvider parameter allows you to retrieve the value at runtime and use it in your functions.该文档显示在 ValueProvider 参数上使用 .get() 方法允许您在运行时检索值并在您的函数中使用它。 Literally:字面上地:

“To use runtime parameter values in your own functions, update the functions to use ValueProvider parameters.” “要在您自己的函数中使用运行时参数值,请更新函数以使用 ValueProvider 参数。”

Here, ValueProvider.get() is called inside of the runtime method DoFn.process().这里,ValueProvider.get() 在运行时方法 DoFn.process() 内被调用。

Based on this I suggest you change your code following 2 and retry.基于此,我建议您在2 之后更改代码并重试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM