简体   繁体   English

Kafka -> Flink DataStream -> MongoDB

[英]Kafka -> Flink DataStream -> MongoDB

I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB.我想设置 Flink,以便它将数据流从 Apache Kafka 转换和重定向到 MongoDB。 For testing purposes I'm building on top of flink-streaming-connectors.kafka example ( https://github.com/apache/flink ).出于测试目的,我在 flink-streaming-connectors.kafka 示例( https://github.com/apache/flink )之上构建。

Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. Kafka 流被 Flink 正确红色,我可以映射它们等,但是当我想将每个收到和转换的消息保存到 MongoDB 时就会出现问题。 The only example I've found about MongoDB integration is flink-mongodb-test from github.我发现的关于 MongoDB 集成的唯一示例是来自 github 的 flink-mongodb-test。 Unfortunately it uses static data source (database), not the Data Stream.不幸的是,它使用静态数据源(数据库),而不是数据流。

I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.我相信 MongoDB 应该有一些 DataStream.addSink 实现,但显然没有。

What would be the best way to achieve it?实现它的最佳方法是什么? Do I need to write the custom sink function or maybe I'm missing something?我需要编写自定义接收器函数还是我遗漏了什么? Maybe it should be done in different way?也许它应该以不同的方式完成?

I'm not tied to any solution, so any suggestion would be appreciated.我不依赖于任何解决方案,因此任何建议将不胜感激。

Below there's an example what exactly i'm getting as an input and what I need to store as an output.下面有一个例子,我到底得到了什么作为输入以及我需要作为输出存储什么。

Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>

Flink: DataStream.map({
    return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection

As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.正如你在这个例子中看到的,我使用 Flink 主要是为了 Kafka 的消息流缓冲和一些基本的解析。

As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.作为 Robert Metzger 答案的替代方法,您可以再次将结果写入 Kafka,然后使用维护的 kafka 连接器之一将主题内容删除到 MongoDB 数据库中。

Kafka -> Flink -> Kafka -> Mongo/Anything Kafka -> Flink -> Kafka -> Mongo/任何东西

With this approach you can mantain the "at-least-once semantics" behaivour.通过这种方法,您可以维护“至少一次语义”行为。

There is currently no Streaming MongoDB sink available in Flink.目前 Flink 中没有可用的 Streaming MongoDB sink。

However, there are two ways for writing data into MongoDB:但是,有两种方法可以将数据写入 MongoDB:

  • Use the DataStream.write() call of Flink.使用 Flink 的DataStream.write()调用。 It allows you to use any OutputFormat (from the Batch API) with streaming.它允许您将任何 OutputFormat(来自 Batch API)与流媒体结合使用。 Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector使用 Flink 的 HadoopOutputFormatWrapper,可以使用官方的 MongoDB Hadoop 连接器

  • Implement the Sink yourself.自己实现接收器。 Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.使用 Streaming API 实现接收器非常容易,而且我确信 MongoDB 有一个很好的 Java 客户端库。

Both approaches do not provide any sophisticated processing guarantees.这两种方法都不提供任何复杂的处理保证。 However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink.但是,当您将 Flink 与 Kafka(并启用检查点)一起使用时,您将拥有至少一次语义:在错误情况下,数据将再次流式传输到 MongoDB 接收器。 If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.如果您正在执行幂等更新,则重做这些更新不应导致任何不一致。

If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.如果你真的需要 MongoDB 的一次性语义,你可能应该在 Flink 中提交一个JIRA并与社区讨论如何实现它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM