Reading Message from Kafka Topic and Dump it into HDFS

Question

I am trying to consume data from Kafka Topic, load it into Dataset and then perform filter before load into into Hdfs.

I am able to consume from kafka topic, load it into dataset and save as a parquet file in HDFS But not able to perform filter condition. can you please share the way to perform filter before save to hdfs? i am using Java with Spark to consume from kafka topic. some part of my code is like this:

DataframeDeserializer dataframe = new DataframeDeserializer(dataset);

 ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);

StreamingQuery query = ds.coalesce(10)
                .writeStream()
                .format("parquet")
                .option("path", path.toString())
                .option("checkpointLocation", "<your path>")
                .trigger(Trigger.Once())
                .start();

Answer 1

Write filter logic before coalesce ie ds.filter().coalesce()


DataframeDeserializer dataframe = new DataframeDeserializer(dataset);

 ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);

StreamingQuery query = 
                ds
                .filter(...) // Write your filter condition here
                .coalesce(10)
                .writeStream()
                .format("parquet")
                .option("path", path.toString())
                .option("checkpointLocation", "<your path>")
                .trigger(Trigger.Once())
                .start();

Answer 2

Instead of re-inventing the wheel, I would strongly recommend Kafka Connect . All you need is the HDFS Sink Connector, that replicates the data from a Kafka topic to HDFS.

For HDFS 2.x files you can use HDFS 2 Sink Connector
For HDFS 3.x files use the HDFS 3 Sink Connector

Reading Message from Kafka Topic and Dump it into HDFS

Question

2 answers

solution1
1 2020-05-05 02:23:29

solution2
1 2020-05-05 08:21:14

Reading Message from Kafka Topic and Dump it into HDFS

Question

2 answers

solution1 1 2020-05-05 02:23:29

solution2 1 2020-05-05 08:21:14

solution1
1 2020-05-05 02:23:29

solution2
1 2020-05-05 08:21:14