简体   繁体   中英

ETL design: What Queue should I use instead of my SQL table and still be able to process in parallel?

Need your help with re-design my system. we have very simple ETL but also very old and now when we handle a massive amount of data it became extremely slow and not flexible

the first process is the collector process:

collector process- always up

  1. collector collect the message from the queue (rabbitMQ)
  2. parse the message properties (JSON format) to java object (for example if the JSON contains field like 'id' and 'name' and 'color' we will create java object with int field 'id' and string field 'name', and string field 'color')
  3. after parsing we write the object to CSV file as CSV row with all the properties in the object
  4. we send ack and continuing to the next message in the queue

processing work-flow - happens every hour once

  1. a process named 'Loader' loads all the CSV files (the collector outputs) to DB table named 'Input' using SQL INFILE LOAD all new rows have 'Not handled' status. the Input table is like a Queue in this design
  2. a process named 'Processor' read from the table all the records with 'Not handled' status, transform it to java object, make some enrichment and then insert the record to another table named 'Output' with new fields, **each iteration we process 1000 rows in parallel - and using JDBC batch update for the DB insert **.

the major problem in this flow:

The message are not flexible in the existing flow - if I want for example to add new property to the JSON message (for example to add also 'city' ) I have to add also column 'city' to the table (because of the CSV file Load), the table contains massive amount of data and its not possible to add column every time the message changes.

My conclusion

The table is not the right choice for this design.

I have to get rid of the CSV writing and remove the 'Input' table to be able to have a flexible system, I thought of maybe using a queue instead of the table like KAFKA and maybe use tools such KAFKA streams for the enrichment. - this will allow me flexible and I won't need to add a column to a table every time I want to add a field to the message the huge problem that I won't be able to process in parallel like I process today.

What can I use instead of table that will allow me to process the data in parallel?

Yes, using Kafka will improve this.

Ingestion

Your process that currently write CSV-files can instead publish to a Kafka topic. This can possibly be a replacement of RabbitMQ, depending on your requirements and scope.

Loader (optional)

Your other process that load data in the initial format and writes to a database table can instead publish to another Kafka topic in the format you want. This step can be omitted if you can write in the format the processor want directly.

Processor

The way you use 'Not handled' status is a way to treat your data as a queue, but this is handled by design in Kafka that uses a log (were a relational database is modeled as a set ).

The processor subscribe to messages written by loader or ingestion. It transform it to java object ,make some enrichment - but instead of inserting the result to a new table, it can publish the data to a new output-topic.

Instead of doing work in batches : "each iteration we process 1000 rows in parallel - and using JDBC batchupdate for the DB insert" with Kafka and stream processing this is done in a continuous real time stream - as data arrives.

Schema evolvability

if i want for example to add new property to the json message (for example to add also 'city' ) i have to add also column 'city' to the table (because of the csv infile Load) , the table contains massive amount of data and its not possible to add column every time the message changes .

You can solve this by using Avro Schema when publishing to a Kafka topic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM