简体   繁体   English

mongo中的批量处理速度非常慢,无法记录一百万条记录

[英]Bulk processing in mongo is very slow for 1 million records

Consider the following scenario: 请考虑以下情形:

CSV file is generated by a reporting tool every friday. CSV文件由每个星期五的报告工具生成。 It contains records for all the employees in the organisation (almost 1 million employees and increasing). 它包含组织中所有员工的记录(近100万名员工,并且正在不断增加)。

This data is saved in mongo using mongoimport in "Employee" Collection. 此数据使用“员工”集合中的mongoimport保存在mongo中。

However, the requirement is to send "Welcome Mail" to new employees and "Year Completion Mail" to existing employees. 但是,要求是向新员工发送“欢迎邮件”,向现有员工发送“年度完成邮件”。

To solve this, I am importing the new file to a temporary collection("EmployeeTemp"). 为了解决这个问题,我将新文件导入一个临时集合(“ EmployeeTemp”)。

For every record in the temporary collection (EmployeeTemp), I check the old collection ("Employee"), for existing employees and mark "SendYearCompletionFlag" as true. 对于临时集合(EmployeeTemp)中的每个记录,我检查现有雇员的旧集合(“ Employee”),并将“ SendYearCompletionFlag”标记为true。 Further, if a new employee record is found, I mark "SendWelcomeFlag" as true. 此外,如果找到新员工记录,则将“ SendWelcomeFlag”标记为true。 Also, the project of each employee needs to be updated. 另外,每个员工的项目都需要更新。

This complete process is executed via a script submitted to mongo. 通过提交给mongo的脚本执行此完整过程。

The issue is that script is taking almost 18 hrs to complete. 问题是该脚本需要将近18个小时才能完成。

Please help me to reduce the execution time of script. 请帮助我减少脚本的执行时间。

This the script: 这个脚本:

var list = db.employeeTemp.find().addOption(DBQuery.Option.noTimeout);
while(list.hasNext()){
    var f = list.next();
    var itr = db.employee.find({"eid":f.eid});
    var obj = itr.hasNext() ? itr.next() : null;
    if(!obj){
        f.joiningDate = new Date();
        f.sendWelcomeMail = true; 
        print("Saving New record : " + f.eid);
        db.save(f);
    } else {
        var joinDate = obj.joiningDate;     
        if(new Date().getTime()-joinDate>=31536000000){
            print("Sending Year Completion Mail to " + obj.eid)
            obj.sendYearCompletionMail = true;
        }
        obj.projecct = f.project;
        print("Saving Existing record : " + obj.eid);
        db.save(obj);
    }
}

I suggest you to create an index on employee.eid. 我建议您在employee.eid上创建索引。

Another thing you can try is to change the batch size in the first find adding batchSize(500) after setting the no timeout option: 您可以尝试做的另一件事是,在设置no timeout选项后,首先找到添加batchSize(500)的方法来更改批量大小:

http://docs.mongodb.org/manual/reference/method/cursor.batchSize/ http://docs.mongodb.org/manual/reference/method/cursor.batchSize/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM