[英]Schedule a Google Cloud Dataflow job in Python
Currently these are the options to schedule the execution of a Dataflow's job that I know: 目前,这些是我知道的安排Dataflow工作执行的选项 :
Using App Engine Cron Service or Cloud Functions. 使用App Engine Cron服务或云功能。
From a cron job in a Compute Engine 来自计算引擎中的cron作业
Using windowing in a streaming pipeline 在流式传输管道中使用窗口
I use App Engine Flex as my Dataflow launcher. 我使用App Engine Flex作为我的Dataflow启动器。 This microservice has endpoints to launch dataflow jobs on demand, which cron can hit too. 这个微服务具有按需启动数据流作业的端点,cron也可以这样做。
This is my project structure: 这是我的项目结构:
df_tasks/
- __init__.py
- datastore_to_csv.py
- ...other_piplines
__init__.py
dflaunch.yaml
main.py
setup.py <-- used by pipelines
The trick with this for me was getting my pipeline dependencies set up correctly. 对我而言,这个技巧就是正确设置我的管道依赖项。 Namely, using a setup.py for pipeline dependencies. 即,使用setup.py进行管道依赖。 Setting it up like this example helped out the most: https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset 像这个例子一样设置它最有帮助: https : //github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset
setup.py: setup.py:
import setuptools
setuptools.setup(
name='dataflow_python_pipeline',
version='1.0.0',
description='DataFlow Python Pipeline',
packages=setuptools.find_packages(),
)
My pipline configs in df_tasks
then looks like this: 我在df_tasks
pipline配置看起来像这样:
pipeline_options = PipelineOptions.from_dictionary({
'project': project,
'runner': 'DataflowRunner',
'staging_location': bucket_path+'/staging',
'temp_location': bucket_path+'/temp',
'setup_file': './setup.py'
})
Then in main.py: 然后在main.py中:
from df_tasks import datastore_to_csv
project_id = os.environ['GCLOUD_PROJECT']
@app.route('/datastore-to-csv', methods=['POST'])
def df_day_summary():
# Extract Payload
payload = request.get_json()
model = payload['model']
for_date = datetime.datetime.strptime(payload['for_date'], '%Y/%m/%d')
except Exception as e:
print traceback.format_exc()
return traceback.format_exc()
# launch the job
try:
job_id, job_name = datastore_to_csv.run(
project=project_id,
model=model,
for_date=for_date,
)
# return the job id
return jsonify({'jobId': job_id, 'jobName': job_name})
except Exception as e:
print traceback.format_exc()
return traceback.format_exc()
There are multiple ways, but one that I think would be very convenient for you would be using the DataflowPythonOperator of Apache Airflow. 有多种方法,但我认为非常方便的方法是使用Apache Airflow的DataflowPythonOperator 。
GCP offers a managed service for Apache Airflow in the form of Cloud Composer , which you can use to schedule Dataflow pipelines, or other GCP operations. GCP以Cloud Composer的形式为Apache Airflow提供托管服务,您可以使用它来安排Dataflow管道或其他GCP操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.