Google Dataflow and Apache beam: why ValueProvider

Question

I'm pretty new to Dataflow and trying to build a template with Python.

This is the document that makes me confused.

Is there any reason that we use a ValueProvider?

I've found many official templates using just Python argparse .

When should I use which solution?

Create a subclass of PipelineOptions and use ValueProvider

@classmethod
def _add_argparse_args(cls, parser):
    parser.add_value_provider_argument(
        "--input", dest="input", required=True, help="Input for the pipeline",
    )
    ...

Or 2. Use argparse parsing arguments in if __nam__ == "__main__" block?

Answer 1

The reason to use ValueProvider through PipelineOptions instead of using argparse directly for arguments is to enable runtime parameters. Understanding how this is useful involves understanding the difference between runtime parameters and construction time parameters. As the Dataflow Templates Overview states:

If you use Dataflow templates, staging and execution are separate steps. This separation gives you additional flexibility to decide who can run jobs and where the jobs are run from.

When you create and stage your template, that's when regular arguments are evaluated. So if you use argparse directly, or you have some arguments in your PipelineOptions that use add_argument , those are specified when you first call your template code to construct and stage the job graph. So you will generally only run this once to stage the job, and then afterwards you can execute the job repeatedly.

Whenever you execute a job that's been staged, you are able to specify additional runtime parameters . Unlike the construction time parameters that were only specified once when constructing the pipeline, runtime parameters can be specified each time you run your staged job, and therefore can be changed far more often. However, transforms need to explicitly support runtime arguments as ValueProviders for this to be an option.

So to summarize, the decision of whether to use a ValueProvider or not depends on whether you need the argument when constructing your graph, or when running your pipeline. Construction time arguments are mainly for arguments that affect graph construction, or are expected to stay the same through multiple runs of a job. Arguments that are likely to change each time you run a job, such as input files, should be runtime arguments (ie ValueProviders ) assuming the transforms you're using support it.

Google Dataflow and Apache beam: why ValueProvider

Question

1 answers

solution1
3 ACCPTED 2020-07-21 00:29:07

Google Dataflow and Apache beam: why ValueProvider

Question

1 answers

solution1 3 ACCPTED 2020-07-21 00:29:07

solution1
3 ACCPTED 2020-07-21 00:29:07