简体   繁体   中英

Mongodb oplog synchronization

I am evaluating Mongo DB with Apache Storm. My use case is such that I've to read data from MongoDB in Apache Storm, do some processing in the bolt and dump it into Neo4J graph database.

I am using a Mongo Spout, which will read the data from the oplog file. I went through the documentation which says the primary node writes data to oplog file and replica reads will happen from it asynchronously. I know that the oplog is a capped collection(specified size), data is being written to oplog at a very high velocity and synchronization to replica set is a bit slow. If the oplog reaches its max size, it overwrites the documents from the beginning of the file. During async process, if we get some other data and if the replication is still in-complete, there is a possibility of losing replication set since it will not be synchronized.

My question here is

1) Is there any way to overcome this?

2) How better can we make use of this capped collection when using with Apache Storm?

3) If i give maximum oplog size for eg i give 500GB and oplog has 1gb of data will it occupy and reserve 500gb of size?

4) Is this the right solution for my use case?

Thanks in advance!!!

Yes, you can overcome this by increasing the size of the oplog. This requires a shutdown of the mongo instance to take effect.

I recently worked on a proof of concept similar what you're doing using a tailed cursor in Mongo to subscribe to any changes made in the oplog of a primary and migrate them to another database. We too ultimately looked into Storm to do this in a cleaner fashion. We were not 100% sold on Storm for this use case either but the tailed cursor was a bit ugly and unreliable. I'd use Storm before a tailed cursor.

You can better make use of this capped collection with Storm by having Storm only pick up new commands. The replication issues you touch upon appear to be mutually exclusive from the task of picking up a new command from the Oplog on a primary and executing those operations of interest into Neo4j. If you were reading from the oplog on a secondary, I would better understand this to be an issue regarding what you claim the objective is (ie writing the data to Neo4j). Since you are reading from the Primary's oplog, and can just process the newest commands as they come into it, I am not sure there is an issue here.

Regarding the RS sync issues you raised; if your secondaries are that out of sync that you are losing replication then there are issues that should be resolved well in advance. I do understand and appreciate your point but a system designed to allow that to happen is in need of some TLC.

As you said, the oplog is a capped collection. When it runs out of space it will make room for any new commands to be executed. Nothing is reserved as you put it. Your secondaries would not be able to have those commands applied to them and require a full resync. You need to be concerned with the "replication oplog window" which denotes 1. That's the amount of time an operation will remain in the oplog before being overwritten by a new entry. 2. how long a secondary member can be offline and still catch up to the primary without doing a full resync.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM