简体   繁体   中英

Create dataflow job dynamically

I am new to the Google Cloud and to Dataflow. What I want to do is to create a tool which detects and recovers errors in (large) csv files. However at design time, not every error that should be dealt with is already known. Therefore, I need an easy way to add new functions that handle a specific error.

The tool should be something like a framework for automatically creating a dataflow template based on a user selection. I already thought of a workflow which could work, but as mentioned before I am totally new to this, so please feel free to suggest a better solution:

  1. The user selects which error correction methods should be used in the frontend
  2. A yaml file is created which specifies the selected transformations
  3. A python script parses the yaml file and uses error handling functions to build a dataflow job that executes these functions as specified in the yaml file
  4. The dataflow job is stored as a template and run via a REST API Call for a file stored on the GCP

To achieve the extensibility, new functions which implement the error corrections should easily be added. What I thought of was:

  1. A developer writes the required function and uploads it to a specified folder
  2. The new function is manually added to the frontend/or to a database etc. and can be selected to check for/deal with an error
  3. The user can now select the newly added error handling function and the dataflow template that is being created uses this function without the need of editing the code that builds the dataflow template

However my problem is that I am not sure if this is possible or a "good" solution for this problem. Furthermore, I don't know how to create a python script that uses functions which are not known at design time. (I thought of using something like a strategy pattern, but as far as I know you still need to have the functions implemented at design time already, even though the decision which function to use is being made during run time) Any help would be greatly appreciated!

What you can use in your architecture is Cloud Functions together with Cloud Composer (hosted solution for Airflow). Apache Airflow is designed to run DAGs on a regular schedule, but you can also trigger DAGs in response to events, such as a change in a Cloud Storage bucket (where CSV files can be stored). You can configure this with your frontend and every time new files arrives into the Bucket, the DAG containing step by step process is being triggered.

Please, have a look for the official documentation , which describes the process of launching Dataflow pipelines with Cloud Composer using DataflowTemplateOperator .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM