简体繁体中英

Right tool for near-real-time ETL architecture

原文 2016-10-13 18:11:08 9 2 postgresql/ neo4j/ etl/ airflow/ luigi

We have a system where our primary data store (and "Universal Source of Truth") is Postgres, but we replicate that data both in real-time as well as nightly in aggregate. We currently replicate to Elasticsearch, Redis, Redshift (nightly only), and are adding Neo4j as well.

Our ETL pipeline has gotten expansive enough that we're starting to look at tools like Airflow and Luigi , but from what I can tell from my initial research, these tools are meant almost entirely for batch loads in aggregate.

Is there any tool that can handle an ETL process that can handle both large batch ETL processes as well as on-the-fly, high-volume, individual-record replication? Do Airflow or Luigi handle this and I just missed it?

Thanks!

2 answers

I'm no crazy expert on different ETL engines, but I've done lots with Pentaho Kettle and am pretty happy with it performance wise. Especially if you tune your transformations to take advantage of the parallel processing.

I've mostly used it for handling integrations (real time) and nightly jobs that perform ETL to drive our reporting DB but I'm pretty sure you could set it up to perform many real time tasks.

I did set up web services that called all sorts of things on our back end once in real time but it very much was not under any kind of load and it sounds like you're doing some more hefty things than we are. Then again it's got functionality to cluster the ETL servers and scale things that I've never really played with.

I feels like kettle could do these things if you spent time to set it up right. Overall I love the tool. It's a joy to work in the GUI TBH. If you're not familiar or doubt the power of doing ETL from a GUI you should check it out. You might be surprised.

As far as Luigi goes, you would likely end up with a micro batch approach, running the jobs on a short interval. For example, you could trigger a cron job every minute to check for new records in Postgres tables and process that batch. You can create a task for each item, so that your processing flow itself is around a single item. At high volumes, say more than a few hundred updates per second, this is a real challenge.

Apache Spark has a scalable batch mode and micro-batch mode of processing, and some basic pipelining operators that can be adapted to ETL. However, the complexity level of the solution in terms of supporting infrastructure goes up a quite a bit.

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

ETL tool or ad-hoc solutions?

PostgreSQL RIGHT JOIN ERROR: syntax error at or near “RIGHT”

Real-Time Database Messaging

Architecture Design for Bus Routing with Time

Is PostgreSQL or ElasticSearch/Solr the right tool for this kind of queries?

Postgres db replication not shown in real time

Best way to update a HTML table in real time

SymmetricDS: real time synchronisation of MySQL with PostgreSQL

Run through transaction data, delete rows if they are near another row in time

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data ETL tool or ad-hoc solutions? PostgreSQL RIGHT JOIN ERROR: syntax error at or near “RIGHT” Real-Time Database Messaging Architecture Design for Bus Routing with Time Is PostgreSQL or ElasticSearch/Solr the right tool for this kind of queries? Postgres db replication not shown in real time Best way to update a HTML table in real time SymmetricDS: real time synchronisation of MySQL with PostgreSQL Run through transaction data, delete rows if they are near another row in time

Related Tags

Right tool for near-real-time ETL architecture

Question

2 answers

solution1
1 2016-10-13 19:36:21

solution2
1 ACCPTED 2016-10-17 12:54:20

Right tool for near-real-time ETL architecture

Question

2 answers

solution1 1 2016-10-13 19:36:21

solution2 1 ACCPTED 2016-10-17 12:54:20

solution1
1 2016-10-13 19:36:21

solution2
1 ACCPTED 2016-10-17 12:54:20