简体   繁体   中英

Delete all duplicates of record in collection - MongoDB

I have a mongoDB collection that is like this (below). As you can see it has a number of duplicate records, with maybe a few attributes that differ. Now in my collection there are plus 18000 results, I need to remove all of the duplicates that are in there. I doesn't matter which one I keep, I just need no dupes. Can any one help or point me in the right direction?

{
  commonName: "Lionel Messi",
  firstName: "Lionel",
  lastName: "Messi",
  rating: 97
},{
  commonName: "Lionel Messi",
  firstName: "Lionel",
  lastName: "Messi",
  rating: 96
},{
  commonName: "Lionel Messi",
  firstName: "Lionel",
  lastName: "Messi",
  rating: 92
},{
  commonName: "Jamie Vardy",
  firstName: "Jamie",
  lastName: "Vardy",
  rating: 82
},{
  commonName: "Jamie Vardy",
  firstName: "Jamie",
  lastName: "Vardy",
  rating: 86
}

Create temp collection with unique index of all the four fields, Then copy data from original collection to the temp collection, now temp collection should contain only unique records. after this you can clear original collection records and move records from temp to original collection

You can use aggregate to clean your data, and then use $out to write a collection, or even overwrite your current collection:

db.players.aggregate([
  { 
    $group : {
      _id : { commonName: "$commonName"  },
      commonName: {$first: "$commonName"},
      firstName: {$first: "$firstName"},
      lastName: {$first: "$lastName"},
      rating: {$first: "$rating"},
    }
  },
  { $project : { _id:0, commonName:1, firstName:1, lastName:1, rating:1 } },
  { $out : "players" }
])

Note : If you want to write a new collection use { $out: "newCollection" }

You could clean your data by adding a unique index. Depending on your mongoDB version you have two ways.

If your mongoDB version is 2.6 or older then you can run this command:

db.players.ensureIndex({'commonName' : 1, 'firstName' :1 }, {unique : true, dropDups : true})

If your version is newer then you could do something like this:

db.players.aggregate([
{ "$group": {
   "_id": { "commonName": "$commonName", "firstName": "$firstName"},
   "dups": { "$push": "$_id" },
   "count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }}
]).forEach(function(doc) {
       doc.dups.shift();
       db.events.remove({ "_id": {"$in": doc.dups }});
});

db.players.createIndex({"commonName":1 , "firstName": 1},
{unique:true})

Warning: You should first try this on some test data, just to be sure you are not removing important data that you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM