简体   繁体   中英

MongoDB updating all records in a collection with the results of query from another collection

I have around 40k records to update and each record gets the data from querying another collection.
I have an existing query to do this, but it runs for more than an hour. It usually disconnects, then I rerun it again.
I think there is a better way to do this, I am just a noob with mongodb and this solution works but I am not happy with the execution speed.
Maybe you have a better or much faster solution.

To better illustrate the data, please see below:

accounts

[
  {
    "_id": ObjectId("AC101"),
    "emails":null,
    "name":"Account 101",
    ...
  },
  {
    "_id": ObjectId("AC102"),
    "emails":null,
    "name":"Account 102",
    ...
  },
  {
    "_id": ObjectId("AC103"),
    "emails":null,
    "name":"Account 103",
    ...
  },
  ...
]

account_contacts

[
  {
    "_id": Object("ACC001"),
    "account": {
        "$ref" : "account",
        "$id" : ObjectId("AC101")
    },
    "email":"acc001@test.com",
    "name":"Contact 001",
    ...
  },
  {
    "_id": Object("ACC002"),
    "account": {
        "$ref" : "account",
        "$id" : ObjectId("AC102")
    },
    "email":"acc002@test.com",
    "name":"Contact 002",
    ...
  },
  {
    "_id": Object("ACC003"),
    "account": {
        "$ref" : "account",
        "$id" : ObjectId("AC103")
    },
    "email":"acc003@test.com",
    "name":"Contact 003",
    ...
  },
  {
    "_id": Object("ACC004"),
    "account": {
        "$ref" : "account",
        "$id" : ObjectId("AC103")
    },
    "email":"acc004@test.com",
    "name":"Contact 004",
    ...
  },
  {
    "_id": Object("ACC005"),
    "account": {
        "$ref" : "account",
        "$id" : ObjectId("AC103")
    },
    "email":"acc005@test.com",
    "name":"Contact 005",
    ...
  },
  ...
]

Query:

db.getCollection('accounts').find({ 'emails':{ $eq:null } }).forEach(p => {
    const emails = [];
    db.getCollection('account_contacts').find({"account.$id": p._id}).forEach(c => {
        emails.push(c.email);
    });
    db.getCollection('accounts').updateOne({"_id": p._id}, {$set: {"emails": emails}});
});

I have a filter to get only the accounts with null emails , so that if it gets a timeout error (1hr)... I just rerun the script and it will process those accounts with null emails.

Currently, I do not have any idea on how to improve the query... but I know it is not the best solution for this case since it takes more than an hour.

Update:

While I still cannot make the aggregate/lookUp approach work, I did tried to run the old script in mongo console, which I ran before and executes more than an hour in my ID... If you run it directly in the mongo console, it only takes 12-14 mins which is not bad.

This is what I did for now, but I still want to convert my script to use aggregation.

TIA

Using MongoDB 4.2, you can avoid pulling the documents to the client side if you are willing to use a temporary collection.

Use aggregation to match all of the documents with null email, extract just the _id and store it in a temporary collection. Note that if you have an index on {emails:1, _id:1} it will streamline this part. You may want to procedurally generate the temporary collection name so it doesn't use the same name for successive runs.

db.accounts.aggregate([
    {$match: {emails: null}},
    {$project: {_id: 1}},
    {$out: "temporary_null_email_collection"}
])

Then aggregate the temporary collection, lookup the email from the account_contacts collection, get rid of any extraneous fields, and merge the results back with the accounts collection.

db.temporary_null_email_collection.aggregate([
    {$lookup:{
         from: "account_contacts",
         localField: "_id",
         foreignField: "$id", // verify this field name is correct
         as: contacts
    }},
    {$project: {
          _id: 1,
          emails: "$contacts.emails"
    }},
    {$merge: "accounts"}
])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM