简体   繁体   English

在Python中为GCP数据流作业指定机器类型

[英]Specifying machine type for a GCP Dataflow job in Python

I have a Dataflow template generated by Dataprep and am executing it using Composer (ie Apache Airflow). 我有一个由Dataprep生成的Dataflow模板,并且正在使用Composer(即Apache Airflow)执行它。

The task is triggering the Dataflow job, but this is then failing with an error which according to posts on SO indicates that I need to specify a machine type with higher memory. 该任务正在触发Dataflow作业,但是随后失败,并显示一个错误, 根据SO上的帖子 ,该错误指示我需要指定具有更高内存的计算机类型。

I'm specifying machineType in the DataflowTemplateOperator , but it's not applying to the Dataflow job: 我在DataflowTemplateOperator指定machineType ,但是它不适用于Dataflow作业:

dataflow_default_options={
    'project': 'projectname',
    'zone': 'europe-west1-b',
    'tempLocation': 'gs://bucketname-dataprep-working/temp/',
    'machineType': 'n1-highmem-4'
},

Having investigated this for some time, I've seen conflicting advice as to what to call the machineType attribute - I've also tried workerMachineType , machine-type and worker-machine-type to no avail. 经调查这一段时间,我已经看到了冲突的建议,以什么调用machineType属性-我也试过workerMachineTypemachine-typeworker-machine-type无济于事。

Has anyone here successfully specified a worker type for DataflowTemplateOperator ? 这里有没有人成功地为DataflowTemplateOperator指定了工作者类型?

I'm assuming you're using the Python SDK based on the tag. 我假设您正在使用基于标签的Python SDK。 Have you tried the Python options from the execution parameter documentation ? 您是否尝试过执行参数文档中的Python选项? The Python option is spelled machine_type , which is an alias for worker_machine_type with underscores. Python选项拼写为machine_type ,这是具有下划线的worker_machine_type的别名。

I've not used Composer/Airflow before, so this is just a suggestion. 我以前没有使用过Composer / Airflow,所以这只是一个建议。

As per the hook source , machineType is the only accepted key for template jobs. 根据钩子源machineType是模板作业唯一接受的密钥。 The variables you specify are then used to build a request to the REST API, like so: 然后,使用您指定的变量来构建对REST API的请求,如下所示:

# RuntimeEnvironment
environment = {}
for key in ['maxWorkers', 'zone', 'serviceAccountEmail', 'tempLocation',
            'bypassTempDirValidation', 'machineType', 'network', 'subnetwork']:
    if key in variables:
        environment.update({key: variables[key]})

# LaunchTemplateParameters
body = {"jobName": name,
        "parameters": parameters,
        "environment": environment}

# projects.locations.template.launch
service = self.get_conn()
request = service.projects().locations().templates().launch(
    projectId=variables['project'],
    location=variables['region'],
    gcsPath=dataflow_template,
    body=body
)

The documentation for projects.locations.template.launch specifies that the request body should be an instance of LaunchTemplateParameters , which has another RuntimeEnvironment . projects.locations.template.launch的文档指定请求主体应为LaunchTemplateParameters的实例,该实例具有另一个RuntimeEnvironment This looks to be accurate from the hook source. 从钩子源看来,这是准确的。

Some debugging steps you could take: you could log/inspect the outgoing REST call, or find the call in Stackdriver logging (and therefore metadata related to the job creation request). 您可以采取一些调试步骤:您可以记录/检查传出的REST调用,或在Stackdriver日志记录中找到该调用(以及因此找到与作业创建请求相关的元数据)。

Note: This is only available since [AIRFLOW-1954] , which was part of the Airflow v1.10.0 release. 注意:仅自[AIRFLOW-1954] (这是Airflow v1.10.0发行版的一部分)以来可用。 This means it is only present in certain Cloud Composer versions. 这意味着它仅在某些Cloud Composer版本中存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM