简体   繁体   English

从 Kafka 主题中读取消息并将其转储到 HDFS

[英]Reading Message from Kafka Topic and Dump it into HDFS

I am trying to consume data from Kafka Topic, load it into Dataset and then perform filter before load into into Hdfs.我正在尝试使用来自 Kafka Topic 的数据,将其加载到数据集中,然后在加载到 Hdfs 之前执行过滤。

I am able to consume from kafka topic, load it into dataset and save as a parquet file in HDFS But not able to perform filter condition.我能够从 kafka 主题中使用,将其加载到数据集中并在 HDFS 中另存为镶木地板文件但无法执行过滤条件。 can you please share the way to perform filter before save to hdfs?能不能分享一下保存到hdfs之前执行过滤的方法? i am using Java with Spark to consume from kafka topic.我正在使用 Java 和 Spark 从 kafka 主题中消费。 some part of my code is like this:我的部分代码是这样的:

DataframeDeserializer dataframe = new DataframeDeserializer(dataset);

 ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);

StreamingQuery query = ds.coalesce(10)
                .writeStream()
                .format("parquet")
                .option("path", path.toString())
                .option("checkpointLocation", "<your path>")
                .trigger(Trigger.Once())
                .start();

Write filter logic before coalesce ie ds.filter().coalesce()coalesce之前编写过滤器逻辑,即ds.filter().coalesce()


DataframeDeserializer dataframe = new DataframeDeserializer(dataset);

 ds = dataframe.fromConfluentAvro("value", <your schema path>, <yourmap>, RETAIN_SELECTED_COLUMN_ONLY$.MODULE$);

StreamingQuery query = 
                ds
                .filter(...) // Write your filter condition here
                .coalesce(10)
                .writeStream()
                .format("parquet")
                .option("path", path.toString())
                .option("checkpointLocation", "<your path>")
                .trigger(Trigger.Once())
                .start();


Instead of re-inventing the wheel, I would strongly recommend Kafka Connect .我强烈推荐Kafka Connect ,而不是重新发明轮子。 All you need is the HDFS Sink Connector, that replicates the data from a Kafka topic to HDFS.您只需要 HDFS 接收器连接器,它将数据从 Kafka 主题复制到 HDFS。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM