简体   繁体   中英

Specifying machine type for a GCP Dataflow job in Python

I have a Dataflow template generated by Dataprep and am executing it using Composer (ie Apache Airflow).

The task is triggering the Dataflow job, but this is then failing with an error which according to posts on SO indicates that I need to specify a machine type with higher memory.

I'm specifying machineType in the DataflowTemplateOperator , but it's not applying to the Dataflow job:

dataflow_default_options={
    'project': 'projectname',
    'zone': 'europe-west1-b',
    'tempLocation': 'gs://bucketname-dataprep-working/temp/',
    'machineType': 'n1-highmem-4'
},

Having investigated this for some time, I've seen conflicting advice as to what to call the machineType attribute - I've also tried workerMachineType , machine-type and worker-machine-type to no avail.

Has anyone here successfully specified a worker type for DataflowTemplateOperator ?

I'm assuming you're using the Python SDK based on the tag. Have you tried the Python options from the execution parameter documentation ? The Python option is spelled machine_type , which is an alias for worker_machine_type with underscores.

I've not used Composer/Airflow before, so this is just a suggestion.

As per the hook source , machineType is the only accepted key for template jobs. The variables you specify are then used to build a request to the REST API, like so:

# RuntimeEnvironment
environment = {}
for key in ['maxWorkers', 'zone', 'serviceAccountEmail', 'tempLocation',
            'bypassTempDirValidation', 'machineType', 'network', 'subnetwork']:
    if key in variables:
        environment.update({key: variables[key]})

# LaunchTemplateParameters
body = {"jobName": name,
        "parameters": parameters,
        "environment": environment}

# projects.locations.template.launch
service = self.get_conn()
request = service.projects().locations().templates().launch(
    projectId=variables['project'],
    location=variables['region'],
    gcsPath=dataflow_template,
    body=body
)

The documentation for projects.locations.template.launch specifies that the request body should be an instance of LaunchTemplateParameters , which has another RuntimeEnvironment . This looks to be accurate from the hook source.

Some debugging steps you could take: you could log/inspect the outgoing REST call, or find the call in Stackdriver logging (and therefore metadata related to the job creation request).

Note: This is only available since [AIRFLOW-1954] , which was part of the Airflow v1.10.0 release. This means it is only present in certain Cloud Composer versions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM