简体   繁体   中英

Error when providing arguments during Dataflow Job creation on GCP Console

Since 2021 October 05/06, my GCP Dataflow template files is getting the arguments' values provided during template creation (when I run the .py file in my local machine in order to create the template file on GCP storage) and not getting the arguments provided during job creation based on this same template file. If I don't provide any value during template creation they assume a RuntimeValueProvider (when not using default values for args), but not the values provided during job creation.

The arguments provided during the job creation get stored in the Dataflow job session. If I open the job, go to right-side bar and open "pipeline options", so the correct values provided during job creation are there but they don't reach the code.

I'm running my code from template in a classic way in the GCP console:

gcloud dataflow jobs run JOB_NAME --gcs-location gs://LOCATION/TEMPLATE/FILE --region REGION --project PROJ_NAME --worker-machine-type MACHINE_TYPE --parameters PARAM_1=PARAM_1_VALUE,PARAM_2=PARAM_2_VALUE

I'm using SDK 2.32.0, and inside code I'm using "parser.add_value_provider_argument" and not "parser.add_argument". But I tested it using "parser.add_argument" and I got no success. With both, my code is assuming the arguments values from when I run the .py file.

Example 1

import apache_beam.io.gcp.gcsfilesystem as gcs
from apache_beam.options.pipeline_options import PipelineOptions
class MyOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument('--PARAM_1', 
            type=str)
        parser.add_value_provider_argument('--PARAM_2', 
            type=str)
beam_options = PipelineOptions()
args = beam_options.view_as(MyOptions)

# Some business operations with args that are always assuming the  values provided during template creation
options = {'project': PROJECT,
           'runner': 'DataflowRunner',
           'region': REGION,
           'staging_location': 'gs://{}/temp'.format(BUCKET),
           'temp_location': 'gs://{}/temp'.format(BUCKET),
           'template_location': 'gs://{}/template/batch_poc'.format(BUCKET)}
pipeline_options = PipelineOptions.from_dictionary(options)

with beam.Pipeline(options = pipeline_options) as p:
    lines = (p
            | beam...
            )

Example 2 (same as example 1 but using default values)

# ... same as example 1
class MyOptions(PipelineOptions):
        @classmethod
        def _add_argparse_args(cls, parser):
            parser.add_value_provider_argument('--PARAM_1',
                default="test1",
                type=str)
            parser.add_value_provider_argument('--PARAM_2', 
                default="test2",
                type=str)
# ... same as example 1

In all cases my parameters provided during job creation are ignored.

Case 1: When running example 1 with no args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below). The value in PARAM_1_VALUE and PARAM_2_VALUE are the same: RuntimeValueProvider(...)

LOCALHOST> python3 code.py

GCP> gcloud dataflow jobs run ...
OR
GCP> gcloud dataflow jobs run ... --parameters PARAM_1=another_test_1,PARAM_2=another_test_2

Case 2: When running example 1 with args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below). The value in PARAM_1_VALUE and PARAM_2_VALUE are the same values passed during template creation: another_test_{value} and not another_another_test_{value}

LOCALHOST> python3 code.py --PARAM_1 another_test_1 --PARAM_2 another_test_2

GCP> gcloud dataflow jobs run ...
OR
GCP> gcloud dataflow jobs run ... --parameters PARAM_1=another_another_test_1,PARAM_2=another_another_test_2

Case 3: When running example 2 with no args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below). The value in PARAM_1_VALUE and PARAM_2_VALUE are the default value.

LOCALHOST> python3 code.py

GCP> gcloud dataflow jobs run ...
OR
GCP> gcloud dataflow jobs run ... --parameters PARAM_1=another_test_1,PARAM_2=another_test_2

Case 4: When running example 2 with args on local machine (as python command below) and running its template on GCP console with both cases: args and no args (as second command below). It happens the same as case 2.

Note: I updated both libs: apache-beam and apache-beam[gcp]

Note that the "--PARAM_1_VALUE", "--PARAM_1_VALUE"... values cannot be used during pipeline construction. As per 1 :

“RuntimeValueProvider is the default ValueProvider type. RuntimeValueProvider allows your pipeline to accept a value that is only available during pipeline execution. The value is not available during pipeline construction, so you can't use the value to change your pipeline's workflow graph.”

The documentation shows using the .get() method on the ValueProvider parameter allows you to retrieve the value at runtime and use it in your functions. Literally:

“To use runtime parameter values in your own functions, update the functions to use ValueProvider parameters.”

Here, ValueProvider.get() is called inside of the runtime method DoFn.process().

Based on this I suggest you change your code following 2 and retry.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM