worker_machine_type标记在使用python的Google Cloud Dataflow中无效

Question

I am using Apache Beam in Python with Google Cloud Dataflow (2.3.0). 我在Python中使用Apache Beam和Google Cloud Dataflow（2.3.0）。 When specifying the worker_machine_type parameter as eg n1-highmem-2 or custom-1-6656 , Dataflow runs the job but always uses the standard machine type n1-standard-1 for every worker. 将worker_machine_type参数指定为例如n1-highmem-2 worker_machine_type n1-highmem-2或custom-1-6656 ，Dataflow会运行该作业，但始终为每个工作程序使用标准机器类型n1-standard-1 。

Does anyone have an idea if I am doing something wrong? 如果我做错了，有没有人知道？

Other topics ( here and here ) show that this should be possible, so this might be a version issue. 其他主题（此处和此处）表明这应该是可能的，因此这可能是版本问题。

My code for specifying PipelineOptions (note that all other options do work fine, so it should recognize the worker_machine_type parameter): 我的用于指定PipelineOptions的代码（请注意，所有其他选项都可以正常工作，因此它应该识别worker_machine_type参数）：

def get_cloud_pipeline_options(project):

  options = {
    'runner': 'DataflowRunner',
    'job_name': ('converter-ml6-{}'.format(
        datetime.now().strftime('%Y%m%d%H%M%S'))),
    'staging_location': os.path.join(BUCKET, 'staging'),
    'temp_location': os.path.join(BUCKET, 'tmp'),
    'project': project,
    'region': 'europe-west1',
    'zone': 'europe-west1-d',
    'autoscaling_algorithm': 'THROUGHPUT_BASED',
    'save_main_session': True,
    'setup_file': './setup.py',
    'worker_machine_type': 'custom-1-6656',
    'max_num_workers': 3,
  }

  return beam.pipeline.PipelineOptions(flags=[], **options)

def main(argv=None):
  args = parse_arguments(sys.argv if argv is None else argv)

  pipeline_options = get_cloud_pipeline_options(args.project_id

  pipeline = beam.Pipeline(options=pipeline_options)

Answer 1

This can be solved by using the flag machine_type instead of worker_machine_type . 这可以通过使用flag machine_type而不是worker_machine_type来解决。 The rest of the code works fine. 其余的代码工作正常。

The documentation is thus mentioning the wrong field name. 因此，文档提到了错误的字段名称。

Answer 2

PipelineOptions uses argparse behind the scenes to parse its argument. PipelineOptions在幕后使用argparse来解析其参数。 In the case of machine type, the name of the argument is machine_type however the flag name is worker_machine_type . 在机器类型的情况下，参数的名称是machine_type但是标志名称是worker_machine_type 。 This works fine in the following two cases, where argparse does its parsing and is aware of this aliasing: 这在以下两种情况下工作正常，其中argparse进行解析并知道这种别名：

Passing arguments on the commandline. 在命令行上传递参数。 eg my_pipeline.py --worker_machine_type custom-1-6656 例如my_pipeline.py --worker_machine_type custom-1-6656
Passing arguments as a command line flags eg flags['--worker_machine_type', 'worker_machine_type custom-1-6656', ...] 将参数作为命令行标记，例如flags['--worker_machine_type', 'worker_machine_type custom-1-6656', ...]

However it does not work well with **kwargs . 然而它与**kwargs不兼容。 Any additional args passed in that way are used to substitute for known argument names (but not flag names). 以这种方式传递的任何其他args用于替换已知的参数名称（但不是标志名称）。

In short, using machine_type would work everywhere. 简而言之，使用machine_type可以在任何地方使用。 I filed https://issues.apache.org/jira/browse/BEAM-4112 for this to be fixed in Beam in the future. 我提交了https://issues.apache.org/jira/browse/BEAM-4112 ，以便将来在Beam中修复。

Answer 3

什么在Apache的梁2.8.0工作对我来说是更新该行通过更改源代码的--worker_machine_type到--machine_type （然后使用machine_type作为参数的名称，如其他答案建议）。

worker_machine_type标记在使用python的Google Cloud Dataflow中无效

问题描述

3 个解决方案

解决方案1
2 2018-04-13 06:57:50

解决方案2
2 已采纳 2018-04-18 01:26:51

解决方案3
0 2018-11-28 00:02:07

worker_machine_type标记在使用python的Google Cloud Dataflow中无效

问题描述

3 个解决方案

解决方案1 2 2018-04-13 06:57:50

解决方案2 2 已采纳 2018-04-18 01:26:51

解决方案3 0 2018-11-28 00:02:07

解决方案1
2 2018-04-13 06:57:50

解决方案2
2 已采纳 2018-04-18 01:26:51

解决方案3
0 2018-11-28 00:02:07