简体   繁体   中英

Bulk processing in mongo is very slow for 1 million records

Consider the following scenario:

CSV file is generated by a reporting tool every friday. It contains records for all the employees in the organisation (almost 1 million employees and increasing).

This data is saved in mongo using mongoimport in "Employee" Collection.

However, the requirement is to send "Welcome Mail" to new employees and "Year Completion Mail" to existing employees.

To solve this, I am importing the new file to a temporary collection("EmployeeTemp").

For every record in the temporary collection (EmployeeTemp), I check the old collection ("Employee"), for existing employees and mark "SendYearCompletionFlag" as true. Further, if a new employee record is found, I mark "SendWelcomeFlag" as true. Also, the project of each employee needs to be updated.

This complete process is executed via a script submitted to mongo.

The issue is that script is taking almost 18 hrs to complete.

Please help me to reduce the execution time of script.

This the script:

var list = db.employeeTemp.find().addOption(DBQuery.Option.noTimeout);
while(list.hasNext()){
    var f = list.next();
    var itr = db.employee.find({"eid":f.eid});
    var obj = itr.hasNext() ? itr.next() : null;
    if(!obj){
        f.joiningDate = new Date();
        f.sendWelcomeMail = true; 
        print("Saving New record : " + f.eid);
        db.save(f);
    } else {
        var joinDate = obj.joiningDate;     
        if(new Date().getTime()-joinDate>=31536000000){
            print("Sending Year Completion Mail to " + obj.eid)
            obj.sendYearCompletionMail = true;
        }
        obj.projecct = f.project;
        print("Saving Existing record : " + obj.eid);
        db.save(obj);
    }
}

I suggest you to create an index on employee.eid.

Another thing you can try is to change the batch size in the first find adding batchSize(500) after setting the no timeout option:

http://docs.mongodb.org/manual/reference/method/cursor.batchSize/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM