简体   繁体   中英

Process real time data using kafka

I have a requirement to implement the solution for below usecase.

Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution.

  1. Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic.
  2. Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned). So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic. We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion.

Could you provide pro's and con's for step2 comparing to sqoop incremental.

please share code snippets/links if any which helps my thought process.

Getting data into Kafka is easy - use Debezium.


For getting it out...

I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.

Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.

I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.

Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.

And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM