简体   繁体   中英

Right tool for near-real-time ETL architecture

We have a system where our primary data store (and "Universal Source of Truth") is Postgres, but we replicate that data both in real-time as well as nightly in aggregate. We currently replicate to Elasticsearch, Redis, Redshift (nightly only), and are adding Neo4j as well.

Our ETL pipeline has gotten expansive enough that we're starting to look at tools like Airflow and Luigi , but from what I can tell from my initial research, these tools are meant almost entirely for batch loads in aggregate.

Is there any tool that can handle an ETL process that can handle both large batch ETL processes as well as on-the-fly, high-volume, individual-record replication? Do Airflow or Luigi handle this and I just missed it?

Thanks!

I'm no crazy expert on different ETL engines, but I've done lots with Pentaho Kettle and am pretty happy with it performance wise. Especially if you tune your transformations to take advantage of the parallel processing.

I've mostly used it for handling integrations (real time) and nightly jobs that perform ETL to drive our reporting DB but I'm pretty sure you could set it up to perform many real time tasks.

I did set up web services that called all sorts of things on our back end once in real time but it very much was not under any kind of load and it sounds like you're doing some more hefty things than we are. Then again it's got functionality to cluster the ETL servers and scale things that I've never really played with.

I feels like kettle could do these things if you spent time to set it up right. Overall I love the tool. It's a joy to work in the GUI TBH. If you're not familiar or doubt the power of doing ETL from a GUI you should check it out. You might be surprised.

As far as Luigi goes, you would likely end up with a micro batch approach, running the jobs on a short interval. For example, you could trigger a cron job every minute to check for new records in Postgres tables and process that batch. You can create a task for each item, so that your processing flow itself is around a single item. At high volumes, say more than a few hundred updates per second, this is a real challenge.

Apache Spark has a scalable batch mode and micro-batch mode of processing, and some basic pipelining operators that can be adapted to ETL. However, the complexity level of the solution in terms of supporting infrastructure goes up a quite a bit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM