简体   繁体   English

使用Python安排Google Cloud Dataflow作业

[英]Schedule a Google Cloud Dataflow job in Python

Currently these are the options to schedule the execution of a Dataflow's job that I know: 目前,这些是我知道的安排Dataflow工作执行的选项

  • Using App Engine Cron Service or Cloud Functions. 使用App Engine Cron服务或云功能。

    • This example is with Java, There are any official example with Python as simple? 这个例子是用Java编写的,Python有什么官方例子很简单吗?
    • This example is with Python but I'm not sure if currently is still a good option or is "deprecated" 这个例子是Python,但我不确定目前是不是一个好的选择或“弃用”
  • From a cron job in a Compute Engine 来自计算引擎中的cron作业

    • Any tutorial of this? 任何教程吗?
  • Using windowing in a streaming pipeline 在流式传输管道中使用窗口

    • I think this is the easiest but, is the best thinking in total cost? 我认为这是最简单的,但总成本是最好的想法吗?
  • Cloud Scheduler 计划程序

    • Is this a valid method? 这是一种有效的方法吗?

I use App Engine Flex as my Dataflow launcher. 我使用App Engine Flex作为我的Dataflow启动器。 This microservice has endpoints to launch dataflow jobs on demand, which cron can hit too. 这个微服务具有按需启动数据流作业的端点,cron也可以这样做。

This is my project structure: 这是我的项目结构:

df_tasks/
- __init__.py
- datastore_to_csv.py
- ...other_piplines
__init__.py
dflaunch.yaml
main.py
setup.py <-- used by pipelines

The trick with this for me was getting my pipeline dependencies set up correctly. 对我而言,这个技巧就是正确设置我的管道依赖项。 Namely, using a setup.py for pipeline dependencies. 即,使用setup.py进行管道依赖。 Setting it up like this example helped out the most: https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset 像这个例子一样设置它最有帮助: https//github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset

setup.py: setup.py:

import setuptools

setuptools.setup(
    name='dataflow_python_pipeline',
    version='1.0.0',
    description='DataFlow Python Pipeline',
    packages=setuptools.find_packages(),
)

My pipline configs in df_tasks then looks like this: 我在df_tasks pipline配置看起来像这样:

pipeline_options = PipelineOptions.from_dictionary({
        'project': project,
        'runner': 'DataflowRunner',
        'staging_location': bucket_path+'/staging',
        'temp_location': bucket_path+'/temp',
        'setup_file': './setup.py'
    })

Then in main.py: 然后在main.py中:

from df_tasks import datastore_to_csv

project_id = os.environ['GCLOUD_PROJECT']

@app.route('/datastore-to-csv', methods=['POST'])
def df_day_summary():
    # Extract Payload
        payload = request.get_json()
        model = payload['model']
        for_date = datetime.datetime.strptime(payload['for_date'], '%Y/%m/%d')
    except Exception as e:
        print traceback.format_exc()
        return traceback.format_exc()
    # launch the job
    try:
        job_id, job_name = datastore_to_csv.run(
            project=project_id,
            model=model,
            for_date=for_date,
        )
        # return the job id
        return jsonify({'jobId': job_id, 'jobName': job_name})
    except Exception as e:
        print traceback.format_exc()
        return traceback.format_exc()

There are multiple ways, but one that I think would be very convenient for you would be using the DataflowPythonOperator of Apache Airflow. 有多种方法,但我认为非常方便的方法是使用Apache Airflow的DataflowPythonOperator

GCP offers a managed service for Apache Airflow in the form of Cloud Composer , which you can use to schedule Dataflow pipelines, or other GCP operations. GCP以Cloud Composer的形式为Apache Airflow提供托管服务,您可以使用它来安排Dataflow管道或其他GCP操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM