简体   繁体   English

使用mongoose将非常大的CSV保存到mongoDB

[英]Save a very big CSV to mongoDB using mongoose

I have a CSV file containing more than 200'000 rows. 我有一个包含超过200'000行的CSV文件。 I need to save it to MongoDB. 我需要将它保存到MongoDB。

If I try a for loop, Node will run out of memory. 如果我尝试for循环,Node将耗尽内存。

fs.readFile('data.txt', function(err, data) {
  if (err) throw err;

  data.split('\n');

  for (var i = 0; i < data.length, i += 1) {
    var row = data[i].split(',');

    var obj = { /* The object to save */ }

    var entry = new Entry(obj);
    entry.save(function(err) {
      if (err) throw err;
    }
  } 
}

How can I avoid running out of memony? 我怎样才能避免失去节制?

Welcome to streaming. 欢迎来到流媒体。 What you really want is an "evented stream" that processes your input "one chunk at a time", and of course ideally by a common delimiter such as the "newline" character you are currently using. 你真正想要的是一个“偶数流”,它可以处理你的输入“一次一个块”,当然理想情况下是一个常见的分隔符,例如你当前使用的“换行符”。

For really efficient stuff, you can add usage of MongoDB "Bulk API" inserts to make your loading as fast as possible without eating up all of the machine memory or CPU cycles. 对于非常有效的东西,您可以添加MongoDB “Bulk API”插入的使用,以便尽可能快地加载,而不会占用所有机器内存或CPU周期。

Not advocating as there are various solutions available, but here is a listing that utilizes the line-input-stream package to make the "line terminator" part simple. 不提倡,因为有各种可用的解决方案,但这里是一个利用行输入流包来使“线路终结器”部分简单的列表。

Schema definitions by "example" only: 仅通过“示例”的模式定义:

var LineInputStream = require("line-input-stream"),
    fs = require("fs"),
    async = require("async"),
    mongoose = require("mongoose"),
    Schema = mongoose.Schema;

var entrySchema = new Schema({},{ strict: false })

var Entry = mongoose.model( "Schema", entrySchema );

var stream = LineInputStream(fs.createReadStream("data.txt",{ flags: "r" }));

stream.setDelimiter("\n");

mongoose.connection.on("open",function(err,conn) { 

    // lower level method, needs connection
    var bulk = Entry.collection.initializeOrderedBulkOp();
    var counter = 0;

    stream.on("error",function(err) {
        console.log(err); // or otherwise deal with it
    });

    stream.on("line",function(line) {

        async.series(
            [
                function(callback) {
                    var row = line.split(",");     // split the lines on delimiter
                    var obj = {};             
                    // other manipulation

                    bulk.insert(obj);  // Bulk is okay if you don't need schema
                                       // defaults. Or can just set them.

                    counter++;

                    if ( counter % 1000 == 0 ) {
                        stream.pause();
                        bulk.execute(function(err,result) {
                            if (err) callback(err);
                            // possibly do something with result
                            bulk = Entry.collection.initializeOrderedBulkOp();
                            stream.resume();
                            callback();
                        });
                    } else {
                        callback();
                    }
               }
           ],
           function (err) {
               // each iteration is done
           }
       );

    });

    stream.on("end",function() {

        if ( counter % 1000 != 0 )
            bulk.execute(function(err,result) {
                if (err) throw err;   // or something
                // maybe look at result
            });
    });

});

So generally the "stream" interface there "breaks the input down" in order to process "one line at a time". 因此,通常“流”接口“断开输入”以便一次处理“一行”。 That stops you from loading everything at once. 这会阻止您立即加载所有内容。

The main parts are the "Bulk Operations API" from MongoDB. 主要部分是MongoDB的“批量操作API” This allows you to "queue up" many operations at a time before actually sending to the server. 这允许您在实际发送到服务器之前一次“排队”许多操作。 So in this case with the use of a "modulo", writes are only sent per 1000 entries processed. 因此,在这种情况下使用“模数”,只能处理每1000个条目的写入。 You can really do anything up to the 16MB BSON limit, but keep it manageable. 你可以做任何高达16MB BSON限制的事情,但要保持可管理性。

In addition to the operations being processed in bulk, there is an additional "limiter" in place from the async library. 除了批量处理的操作外,还有一个来自异步库的额外“限制器”。 It's not really required, but this ensures that essentially no more than the "modulo limit" of documents are in process at any time. 它并不是真正需要的,但这确保了在任何时候基本上不超过文档的“模数限制”。 The general batch "inserts" come at no IO cost other than memory, but the "execute" calls mean IO is processing. 除了内存之外,一般批处理“插入”没有IO成本,但“执行”调用意味着IO正在处理。 So we wait rather than queuing up more things. 所以我们等待而不是排队更多的东西。

There are surely better solutions you can find for "stream processing" CSV type data which this appears to be. 对于“流处理”CSV类型数据,肯定会找到更好的解决方案。 But in general this gives you the concepts to how to do this in a memory efficient manner without eating CPU cycles as well. 但总的来说,这为您提供了如何以内存有效的方式执行此操作而不会占用CPU周期的概念。

The accepted answer is great and attempted to cover all the important aspects of this problem. 接受的答案很好,并试图涵盖这个问题的所有重要方面。

  1. Reading the CSV file as a stream of lines 将CSV文件作为一行流读取
  2. Writing the documents in batches to MongoDB 将文档分批写入MongoDB
  3. Synchronization between reading and writing 读写同步

While it did well with first two aspects, the approach taken to address the synchronization using async.series() won't work as expected. 虽然它在前两个方面表现良好,但使用async.series()解决同步问题的方法将无法按预期工作。

stream.on("line",function(line) {
    async.series(
        [
            function(callback) {
                var row = line.split(",");     // split the lines on delimiter
                var obj = {};             
                // other manipulation

                bulk.insert(obj);  // Bulk is okay if you don't need schema
                                   // defaults. Or can just set them.

                counter++;

                if ( counter % 1000 == 0 ) {
                    bulk.execute(function(err,result) {
                        if (err) throw err;   // or do something
                        // possibly do something with result
                        bulk = Entry.collection.initializeOrderedBulkOp();
                        callback();
                    });
                } else {
                    callback();
                }
           }
       ],
       function (err) {
           // each iteration is done
       }
   );
});

Here bulk.execute() is a mongodb write operation and its an asynchronous IO call. 这里的bulk.execute()是一个mongodb写操作,它是一个异步IO调用。 This allows node.js to proceed with the event loop before bulk.execute() is done with its db writes and calls back. 这允许node.js在使用db编写和回调完成bulk.execute()之前继续执行事件循环。

So it may go on to receive more 'line' events from the stream and queue more documents bulk.insert(obj) and can hit next modulo to trigger bulk.execute() again. 因此,它可能继续从流中接收更多“行”事件并将更多文档bulk.insert(obj)队列bulk.insert(obj)并可以命中下一个模数以再次触发bulk.execute()。

Lets have a look at this example. 让我们来看看这个例子。

var async = require('async');

var bulk = {
    execute: function(callback) {
        setTimeout(callback, 1000);
    }
};

async.series(
    [
       function (callback) {
           bulk.execute(function() {
              console.log('completed bulk.execute');
              callback(); 
           });
       },
    ], 
    function(err) {

    }
);

console.log("!!! proceeding to read more from stream");

It's output 这是输出

!!! proceeding to read more from stream
completed bulk.execute

To really ensure that we are processing one batch of N documents at any given time, we need to enforce flow control on the file stream using stream.pause() & stream.resume() 为了确保我们在任何给定时间处理一批N个文档,我们需要使用stream.pause()stream.resume()对文件流强制执行流控制。

var LineInputStream = require("line-input-stream"),
    fs = require("fs"),
    mongoose = require("mongoose"),
    Schema = mongoose.Schema;

var entrySchema = new Schema({},{ strict: false });
var Entry = mongoose.model( "Entry", entrySchema );

var stream = LineInputStream(fs.createReadStream("data.txt",{ flags: "r" }));

stream.setDelimiter("\n");

mongoose.connection.on("open",function(err,conn) { 

    // lower level method, needs connection
    var bulk = Entry.collection.initializeOrderedBulkOp();
    var counter = 0;

    stream.on("error",function(err) {
        console.log(err); // or otherwise deal with it
    });

    stream.on("line",function(line) {
        var row = line.split(",");     // split the lines on delimiter
        var obj = {};             
        // other manipulation

        bulk.insert(obj);  // Bulk is okay if you don't need schema
                           // defaults. Or can just set them.

        counter++;

        if ( counter % 1000 === 0 ) {
            stream.pause(); //lets stop reading from file until we finish writing this batch to db

            bulk.execute(function(err,result) {
                if (err) throw err;   // or do something
                // possibly do something with result
                bulk = Entry.collection.initializeOrderedBulkOp();

                stream.resume(); //continue to read from file
            });
        }
    });

    stream.on("end",function() {
        if ( counter % 1000 != 0 ) {
            bulk.execute(function(err,result) {
                if (err) throw err;   // or something
                // maybe look at result
            });
        }
    });

});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM