I am trying to consume data from Kafka Topic, load it into Dataset and then perform filter before load into into Hdfs.
I am able to consume from kafka topic, load it into dataset and save as a parquet file in HDFS But not able to perform filter condition. can you please share the way to perform filter before save to hdfs? i am using Java with Spark to consume from kafka topic. some part of my code is like this:
DataframeDeserializer dataframe = new DataframeDeserializer(dataset);
ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);
StreamingQuery query = ds.coalesce(10)
.writeStream()
.format("parquet")
.option("path", path.toString())
.option("checkpointLocation", "<your path>")
.trigger(Trigger.Once())
.start();
Write filter logic before coalesce
ie ds.filter().coalesce()
DataframeDeserializer dataframe = new DataframeDeserializer(dataset);
ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);
StreamingQuery query =
ds
.filter(...) // Write your filter condition here
.coalesce(10)
.writeStream()
.format("parquet")
.option("path", path.toString())
.option("checkpointLocation", "<your path>")
.trigger(Trigger.Once())
.start();
Instead of re-inventing the wheel, I would strongly recommend Kafka Connect . All you need is the HDFS Sink Connector, that replicates the data from a Kafka topic to HDFS.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.