简体   繁体   中英

Reading Message from Kafka Topic and Dump it into HDFS

I am trying to consume data from Kafka Topic, load it into Dataset and then perform filter before load into into Hdfs.

I am able to consume from kafka topic, load it into dataset and save as a parquet file in HDFS But not able to perform filter condition. can you please share the way to perform filter before save to hdfs? i am using Java with Spark to consume from kafka topic. some part of my code is like this:

DataframeDeserializer dataframe = new DataframeDeserializer(dataset);

 ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);

StreamingQuery query = ds.coalesce(10)
                .writeStream()
                .format("parquet")
                .option("path", path.toString())
                .option("checkpointLocation", "<your path>")
                .trigger(Trigger.Once())
                .start();

Write filter logic before coalesce ie ds.filter().coalesce()


DataframeDeserializer dataframe = new DataframeDeserializer(dataset);

 ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);

StreamingQuery query = 
                ds
                .filter(...) // Write your filter condition here
                .coalesce(10)
                .writeStream()
                .format("parquet")
                .option("path", path.toString())
                .option("checkpointLocation", "<your path>")
                .trigger(Trigger.Once())
                .start();


Instead of re-inventing the wheel, I would strongly recommend Kafka Connect . All you need is the HDFS Sink Connector, that replicates the data from a Kafka topic to HDFS.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM