worker_machine_type标记在使用python的Google Cloud Dataflow中无效

Question

我在Python中使用Apache Beam和Google Cloud Dataflow（2.3.0）。 将worker_machine_type参数指定为例如n1-highmem-2 worker_machine_type n1-highmem-2或custom-1-6656 ，Dataflow会运行该作业，但始终为每个工作程序使用标准机器类型n1-standard-1 。

如果我做错了，有没有人知道？

其他主题（此处和此处）表明这应该是可能的，因此这可能是版本问题。

我的用于指定PipelineOptions的代码（请注意，所有其他选项都可以正常工作，因此它应该识别worker_machine_type参数）：

def get_cloud_pipeline_options(project):

  options = {
    'runner': 'DataflowRunner',
    'job_name': ('converter-ml6-{}'.format(
        datetime.now().strftime('%Y%m%d%H%M%S'))),
    'staging_location': os.path.join(BUCKET, 'staging'),
    'temp_location': os.path.join(BUCKET, 'tmp'),
    'project': project,
    'region': 'europe-west1',
    'zone': 'europe-west1-d',
    'autoscaling_algorithm': 'THROUGHPUT_BASED',
    'save_main_session': True,
    'setup_file': './setup.py',
    'worker_machine_type': 'custom-1-6656',
    'max_num_workers': 3,
  }

  return beam.pipeline.PipelineOptions(flags=[], **options)

def main(argv=None):
  args = parse_arguments(sys.argv if argv is None else argv)

  pipeline_options = get_cloud_pipeline_options(args.project_id

  pipeline = beam.Pipeline(options=pipeline_options)

Answer 1

这可以通过使用flag machine_type而不是worker_machine_type来解决。 其余的代码工作正常。

因此，文档提到了错误的字段名称。

Answer 2

PipelineOptions在幕后使用argparse来解析其参数。 在机器类型的情况下，参数的名称是machine_type但是标志名称是worker_machine_type 。 这在以下两种情况下工作正常，其中argparse进行解析并知道这种别名：

在命令行上传递参数。 例如my_pipeline.py --worker_machine_type custom-1-6656
将参数作为命令行标记，例如flags['--worker_machine_type', 'worker_machine_type custom-1-6656', ...]

然而它与**kwargs不兼容。 以这种方式传递的任何其他args用于替换已知的参数名称（但不是标志名称）。

简而言之，使用machine_type可以在任何地方使用。 我提交了https://issues.apache.org/jira/browse/BEAM-4112 ，以便将来在Beam中修复。

Answer 3

什么在Apache的梁2.8.0工作对我来说是更新该行通过更改源代码的--worker_machine_type到--machine_type （然后使用machine_type作为参数的名称，如其他答案建议）。

worker_machine_type标记在使用python的Google Cloud Dataflow中无效

问题描述

3 个解决方案

解决方案1
2 2018-04-13 06:57:50

解决方案2
2 已采纳 2018-04-18 01:26:51

解决方案3
0 2018-11-28 00:02:07

worker_machine_type标记在使用python的Google Cloud Dataflow中无效

问题描述

3 个解决方案

解决方案1 2 2018-04-13 06:57:50

解决方案2 2 已采纳 2018-04-18 01:26:51

解决方案3 0 2018-11-28 00:02:07

解决方案1
2 2018-04-13 06:57:50

解决方案2
2 已采纳 2018-04-18 01:26:51

解决方案3
0 2018-11-28 00:02:07