简体繁体中英

Process real time data using kafka

原文 2023-01-16 11:37:23 8 1 apache-spark/ apache-kafka/ hive/ sqoop

I have a requirement to implement the solution for below usecase.

Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution.

Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic.
Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned). So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic. We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion.

Could you provide pro's and con's for step2 comparing to sqoop incremental.

please share code snippets/links if any which helps my thought process.

1 answers

Getting data into Kafka is easy - use Debezium.

For getting it out...

I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.

Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.

I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.

Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.

And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.

How to schedule a real time data pipeline (flume, kafka, spark streaming)?

Spark Streaming Real time integration with Kafka

Stream and process data based on timestamp values (Using Kafka and Spark Streaming)

Can Kafka-Spark Streaming pair be used for both batch+real time data?

Is is possible to parse JSON string from Kafka topic in real time using Spark Streaming SQL?

Read from Kafka topic process the data and write back to Kafka topic using scala and spark

If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc?

How do I process non-real time data in batches in Spark?

Real time prediction of online data using Spark Streaming and Machine Learning

How to work with real time streaming data/logs using spark streaming?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to schedule a real time data pipeline (flume, kafka, spark streaming)? Spark Streaming Real time integration with Kafka Stream and process data based on timestamp values (Using Kafka and Spark Streaming) Can Kafka-Spark Streaming pair be used for both batch+real time data? Is is possible to parse JSON string from Kafka topic in real time using Spark Streaming SQL? Read from Kafka topic process the data and write back to Kafka topic using scala and spark If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc? How do I process non-real time data in batches in Spark? Real time prediction of online data using Spark Streaming and Machine Learning How to work with real time streaming data/logs using spark streaming?

Related Tags

Process real time data using kafka

Question

1 answers

solution1 0 2023-01-16 12:18:33

solution1
0 2023-01-16 12:18:33