简体   繁体   中英

Pull (all but one) document from array of subdocuments in mongodb

I have a set of data that I'd imported a while back. On every import I'd appended a 'history' subdocument to a history array. The overall structure is something like this but with more fields:

{ _id: ObjectId('000000000000000001'),
  history: [ {date: ISODate("2014-05-25T22:00:00Z"), value: 1},
             {date: ISODate("2014-05-26T22:00:00Z"), value: 1},
             {date: ISODate("2014-05-26T22:00:00Z"), value: 1} 
  ]
}

The problem is, in come cases, the import was bad and I ended up with duplicated history for the same date. I would like to remove all of the duplicates. I attempted to do this using the $pull update operator and was going to call it repeatedly until there was the right amount of history entries for each date. The problem is, I have over a million data points and they each have a different number of duplicates - some with as many as 12 dupes. Is there some way to pull all but one without using mapReduce? I'm thinking something like:

db.test.update({'history.date': new Date(2014,4,26)},
               {
                $pullAll : 
                   {'history': {date: new Date(2014,4,27)}},
                $push : {'history' : {}}
               }, {multi:true})

Try this, This works well:

db.collection.find().forEach(function(doc) {
     db.collection.update(
         { "_id": doc._id },
         { "$set": { "history": [doc.history] } }
     );
})

The problem with what you propose is that you actually end up having conflicting paths in your statement as both operations act on the "history" array. So those operations do not actually perform "sequentially" as you might think and this results in a conflict that should produce an error on trying to parse the query.

Also you are essentially "wiping" the contents of the array, and if your notation was just really shorthand as opposed to intending to just "push" and empty object {} , then there really is no present way to update a document based on the existing values found in that document.

So the end approach is looping to do this, which is really not that bad:

 db.collection.find().forEach(function(doc) {
     db.collection.update(
         { "_id": doc._id },
         { "$set": { "history": [] } }
     );
     db.collection.update(
         { "_id": doc._id },
         { "$addToSet": { "history": { "$each": doc.history } } }
     );
 })

Of course if you have MongoDB 2.6 or greater you can do this in Bulk operations that makes things very efficient:

 var count = 0;
 var bulk = db.collection.initializeOrderedBulkOp();

 db.collection.find().forEach(function(doc) {

    bulk.find({ "_id": doc._id }).update({
        "$set": { "history": [] }
    });
    bulk.find({ "_id": doc._id }).update({
        "$addToSet": { "history": { "$each": doc.history } }
    });
    count++;

    if ( count % 500 == 0 ) {
        bulk.execute();
        bulk = db.collection.initializeOrderedBulkOp();
        count = 0;
    }

 });

 if ( count > 0 )
     bulk.execute();

So that pairs operations up and sends in sets of 500 or 1000 operations which should be safely under the BSON 16MB limit, and of course you can tune that as you want. Though each update is actually performed in series, the actual send/response to the server only occurs once per 500 items in this example.

You might also consider finding the documents that contain the duplicates using the aggregate method instead in order to improve the efficiency by not updating documents that do not need to be updated:

db.collection.aggregate([
    { "$project": {
       "_id": "$$ROOT",
       "history": 1
    }},
    { "$unwind": "$history" },
    { "$group": {
        "_id": { "date": "$history.date", "value": "$history.value" },
        "orig": { "$first": "_id" }
    }},
    { "$group": {
        "_id": "$orig._id",
        "history": { "$first": "$orig.history" }
    }}
]).forEach(function(doc) {
    // same as above

Or even use that as a springboard to remove the duplicates so you only need to send one update per loop using $set by removing the duplicates already

 var count = 0;
 var bulk = db.collection.initializeOrderedBulkOp();

db.collection.aggregate([
    { "$unwind": "$history" },
    { "$group": {
        "_id": { "date": "$history.date", "value": "$history.value" },
        "orig": { "$first": "_id" }
    }},
    { "$group": {
        "_id": "$orig._id",
        "history": { "$push": "$_id" }
    }}
]).forEach(function(doc) {

    bulk.find({ "_id": doc._id }).update({
        "$set": { "history": doc.history }
    });
    count++;

    if ( count % 500 == 0 ) {
        bulk.execute();
        bulk = db.collection.initializeOrderedBulkOp();
        count = 0;
    }
]);

 if ( count > 0 )
     bulk.execute();

So there are a few approaches to getting rid of those duplicate entries that you can consider and adapt to your needs.

I was just about to implement one of the scripts mentioned above, when I got the idea that I could do this in three steps in the mongo shell:

date = new Date(2014,4,26);
temp = 'SOMESPECIALTEMPVALUE'

db.test.update({'history.date': date},
           {$set: {
               'history.$.date' : temp
           }}, {multi:true})

db.test.update({'history.date': temp},
           {$pull: {
               'history.date' : temp
           }}, {multi:true})   

db.test.update({'history.date': temp},
           {$set: {
               'history.$.date' : date
           }}, {multi:true})

This works because the $ only updates the first matching subdocument. Using pull I then remove all the remaining duplicates. Lastly I reset the remaining temp value to its original value. This works well enough for me because it's a one time operation with only three subjective dates. Otherwise I'd probably go with the script approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM