Kafka -> Flink DataStream -> MongoDB

Question

I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB. For testing purposes I'm building on top of flink-streaming-connectors.kafka example ( https://github.com/apache/flink ).

Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. The only example I've found about MongoDB integration is flink-mongodb-test from github. Unfortunately it uses static data source (database), not the Data Stream.

I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.

What would be the best way to achieve it? Do I need to write the custom sink function or maybe I'm missing something? Maybe it should be done in different way?

I'm not tied to any solution, so any suggestion would be appreciated.

Below there's an example what exactly i'm getting as an input and what I need to store as an output.

Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>

Flink: DataStream.map({
    return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection

As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.

Answer 1

As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.

Kafka -> Flink -> Kafka -> Mongo/Anything

With this approach you can mantain the "at-least-once semantics" behaivour.

Answer 2

There is currently no Streaming MongoDB sink available in Flink.

However, there are two ways for writing data into MongoDB:

Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.

Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink. If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.

If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.

Kafka -> Flink DataStream -> MongoDB

Question

2 answers

solution1
4 2018-01-19 11:01:00

solution2
3 ACCPTED 2016-02-02 17:00:35

Kafka -> Flink DataStream -> MongoDB

Question

2 answers

solution1 4 2018-01-19 11:01:00

solution2 3 ACCPTED 2016-02-02 17:00:35

solution1
4 2018-01-19 11:01:00

solution2
3 ACCPTED 2016-02-02 17:00:35