简体   繁体   中英

mongodb update json serialized object with python

This is a problem about frequently updated sub dictionary structure. Trade-off between CPU and IO.

There is a nested dict data structure in memory. Code 1:

domain={}
domain["www.xx.com"]={}
domain["www.xx.com"]["192.105.0.1"]={}
domain["www.xx.com"]["192.105.0.1"]["TTLS"]=Set([20,80,3000])
domain["www.xx.com"]["192.105.0.1"]["FIRST_SEEM"]=1379484935.460281
domain["www.xx.com"]["192.105.0.1"]["LAST_SEEN"]=1379484945.46077

domain["www.xx.com"]["192.105.0.2"]={}
domain["www.xx.com"]["192.105.0.2"]["TTLS"]=Set([70,90,2000])
domain["www.xx.com"]["192.105.0.2"]["FIRST_SEEM"]=13794674935.460281
domain["www.xx.com"]["192.105.0.2"]["LAST_SEEN"]=1379674945.46077

Then serialize the Set part. Code 2:

domain["www.xx.com"]["192.105.0.1"]["TTLS"]=list(domain["www.xx.com"]["192.105.0.1"]["TTLS"])
domain["www.xx.com"]["192.105.0.2"]["TTLS"]=list(domain["www.xx.com"]["192.105.0.2"]["TTLS"])

Then dump this structure to mongodb like, Code 3:

db.myCollection.insert({"_id":"www.xx.com", "IPS":json.dumps(domain["www.xx.com"])})

The item is frequently updated. A new day has come, the program generate a new dict item about "www.xx.com" in memory, then it will update this item with former info in mongodb. yes, the reverse way, for simpler mongo update. Here, json loads return a dictionary, just like what was dumped(except the set). Code 4

mongo_dict=json.loads(db.myCollection.find_one({"_id":"www.xx.com"}))
update_domain_with_mongo_dict(mongo_dict)

so, at the end of this day, program just dump the whole domain["www.xx.com"] memory to mongo. this saves document update work, simpler IO, leave the dirty work to python program. (I have read many complains about mongo's poor sub document update ability.) Code 5

db.myCollection.update({"_id":"www.xx.com"},{"$set":{"IPS":json.dumps(domain["www.xx.com"])}})

However, it seems that, many update is meaningless. Even no udpate occurs or just a slight update, the program will have to restore the dict item to mongodb. Concerning this, the IO is too large. Here is the problem, I need to update the sub document seperately, with many for-loops and new-update check. so, json dumps/loads, good bye.

then, the refined mongo object and codes might look like this: Code 6

{
    "_id":"www.xx.com"
    "IPS":[
        {
            "IP":"192.168.0.1"
            "TTLS":[20, 80, 3000]
            "FIRST_SEEN":1379484935.460281
            "LAST_SEEN":1379484945.46077
        }
        {
            "IP":"192.168.0.2"
            "TTLS":[70, 90, 2000]
            "FIRST_SEEN":13794674935.460281
            "LAST_SEEN":1379674945.46077
        }
    ]
}
db.update({"_id":"www.xx.com"}, 'IPS'.0.'FIRST_SEEN':1379674945.46077)

however, this kind of update need the index '0', this is decided by the key:ip. In this structure, I give up the json dumps/loads, which means dict is abandoned. To get the index, a for-loop is inevitable. This might save IO, but, the CPU will cry.

So, guys, you have read too much, what's your choice? any fantastic solution, surprise me. Let me know if I miss anything. Thanks.

You shouldn't need to track all this in your app. Mongo is pretty robust

When adding new things you want to use $push

db.myCollection.update(
    {"_id":"www.xx.com"},
    {"$push":{"IPS": <new to be added>}}
)

That means you can add each item separately rather than putting them all in at once. But most importantly this allows you to grow without caring what is in there. You can also add $each to do this in sets if you have to.

When you want to update things, rather than keep track of the index, find the element on a value in the array:

db.myCollection.update(
    {"_id":"www.xx.com", "IPS.IP" <IP Entry>},
    {"$set":{"IPS.$.LAST_SEEN": <new date value>}}
)

So that uses a positional operator will make sure the update occurs at the position of the matched "IP".

If you are really worried about latency, then maybe your use case might allow you to dial back the write concern . You can read that reference to understand what you might be able to get away with to speed write responses up a bit.

At any rate you should be able to wind back your application storage of this information a bit rather than try to track everything, just be implementing these methods. That is, with some changes, if you even need to keep that part at all.

So by all means, keep a cache, and don't write every update. But when you do write, flush it out and using these sorts of operations you are getting value out of your writes.

Take some time to look at the full operator reference while you are there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM