简体   繁体   中英

Best data pipeline framework

What is the best data pipeline framework that fits the following requirements?:

  • Open source / free to use
  • Data pipeline need to be created using Python (should support Geopandas, Pandas, Numpy, ...)
  • Support manuel and time triggered pipelines
  • Web interface where non-technical users can start a pipeline (ordre data). It must be possible to use variables, which need to be defined at runtime.
  • Supports the ability to run pipelines in individual Docker containers
  • Integrates with source control (GIT). Ie download the newest date pipeline from GIT

I have investigated Apache Airflow, but would like to hear if there is better alternative on the market, which supports the requirements defined above:)

I am about to propose a framework that nearly complies with all your requirements. A Versatile Data Kit is a DataOps framework that allows anyone with basic SQL or Python knowledge to create data pipelines .

I'll follow your points:

  • It is open source and free to use.
  • Data Pipeline can be created using Python, SQL, or both.
  • It can be manually triggered via CLI or scheduled via a cron-like line in the config file.
  • Recently, Apache Airflow integration was released, which can be used as the interface for non-tech users to trigger the pipeline. Theoretically, the variables should be possible to set with Airflow, but we don't have support for that in the VDK Airflow Provider at this point.
  • It runs on Kubernetes. Each data job is run in a docker container once deployed.
  • Uses git to deploy data jobs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM