简体   繁体   English

worker_machine_type标记在使用python的Google Cloud Dataflow中无效

[英]worker_machine_type tag not working in Google Cloud Dataflow with python

I am using Apache Beam in Python with Google Cloud Dataflow (2.3.0). 我在Python中使用Apache Beam和Google Cloud Dataflow(2.3.0)。 When specifying the worker_machine_type parameter as eg n1-highmem-2 or custom-1-6656 , Dataflow runs the job but always uses the standard machine type n1-standard-1 for every worker. worker_machine_type参数指定为例如n1-highmem-2 worker_machine_type n1-highmem-2custom-1-6656 ,Dataflow会运行该作业,但始终为每个工作程序使用标准机器类型n1-standard-1

Does anyone have an idea if I am doing something wrong? 如果我做错了,有没有人知道?

Other topics ( here and here ) show that this should be possible, so this might be a version issue. 其他主题( 此处此处 )表明这应该是可能的,因此这可能是版本问题。

My code for specifying PipelineOptions (note that all other options do work fine, so it should recognize the worker_machine_type parameter): 我的用于指定PipelineOptions的代码(请注意,所有其他选项都可以正常工作,因此它应该识别worker_machine_type参数):

def get_cloud_pipeline_options(project):

  options = {
    'runner': 'DataflowRunner',
    'job_name': ('converter-ml6-{}'.format(
        datetime.now().strftime('%Y%m%d%H%M%S'))),
    'staging_location': os.path.join(BUCKET, 'staging'),
    'temp_location': os.path.join(BUCKET, 'tmp'),
    'project': project,
    'region': 'europe-west1',
    'zone': 'europe-west1-d',
    'autoscaling_algorithm': 'THROUGHPUT_BASED',
    'save_main_session': True,
    'setup_file': './setup.py',
    'worker_machine_type': 'custom-1-6656',
    'max_num_workers': 3,
  }

  return beam.pipeline.PipelineOptions(flags=[], **options)

def main(argv=None):
  args = parse_arguments(sys.argv if argv is None else argv)

  pipeline_options = get_cloud_pipeline_options(args.project_id

  pipeline = beam.Pipeline(options=pipeline_options)

This can be solved by using the flag machine_type instead of worker_machine_type . 这可以通过使用flag machine_type而不是worker_machine_type来解决。 The rest of the code works fine. 其余的代码工作正常。

The documentation is thus mentioning the wrong field name. 因此, 文档提到了错误的字段名称。

PipelineOptions uses argparse behind the scenes to parse its argument. PipelineOptions在幕后使用argparse来解析其参数。 In the case of machine type, the name of the argument is machine_type however the flag name is worker_machine_type . 在机器类型的情况下,参数的名称是machine_type但是标志名称是worker_machine_type This works fine in the following two cases, where argparse does its parsing and is aware of this aliasing: 这在以下两种情况下工作正常,其中argparse进行解析并知道这种别名:

  1. Passing arguments on the commandline. 在命令行上传递参数。 eg my_pipeline.py --worker_machine_type custom-1-6656 例如my_pipeline.py --worker_machine_type custom-1-6656
  2. Passing arguments as a command line flags eg flags['--worker_machine_type', 'worker_machine_type custom-1-6656', ...] 将参数作为命令行标记,例如flags['--worker_machine_type', 'worker_machine_type custom-1-6656', ...]

However it does not work well with **kwargs . 然而它与**kwargs不兼容。 Any additional args passed in that way are used to substitute for known argument names (but not flag names). 以这种方式传递的任何其他args用于替换已知的参数名称(但不是标志名称)。

In short, using machine_type would work everywhere. 简而言之,使用machine_type可以在任何地方使用。 I filed https://issues.apache.org/jira/browse/BEAM-4112 for this to be fixed in Beam in the future. 我提交了https://issues.apache.org/jira/browse/BEAM-4112 ,以便将来在Beam中修复。

什么在Apache的梁2.8.0工作对我来说是更新该行通过更改源代码的--worker_machine_type--machine_type (然后使用machine_type作为参数的名称,如其他答案建议)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM