使用mongoose將非常大的CSV保存到mongoDB

Question

我有一個包含超過200'000行的CSV文件。 我需要將它保存到MongoDB。

如果我嘗試for循環，Node將耗盡內存。

fs.readFile('data.txt', function(err, data) {
  if (err) throw err;

  data.split('\n');

  for (var i = 0; i < data.length, i += 1) {
    var row = data[i].split(',');

    var obj = { /* The object to save */ }

    var entry = new Entry(obj);
    entry.save(function(err) {
      if (err) throw err;
    }
  } 
}

我怎樣才能避免失去節制？

Answer 1

歡迎來到流媒體。 你真正想要的是一個“偶數流”，它可以處理你的輸入“一次一個塊”，當然理想情況下是一個常見的分隔符，例如你當前使用的“換行符”。

對於非常有效的東西，您可以添加MongoDB “Bulk API”插入的使用，以便盡可能快地加載，而不會占用所有機器內存或CPU周期。

不提倡，因為有各種可用的解決方案，但這里是一個利用行輸入流包來使“線路終結器”部分簡單的列表。

僅通過“示例”的模式定義：

var LineInputStream = require("line-input-stream"),
    fs = require("fs"),
    async = require("async"),
    mongoose = require("mongoose"),
    Schema = mongoose.Schema;

var entrySchema = new Schema({},{ strict: false })

var Entry = mongoose.model( "Schema", entrySchema );

var stream = LineInputStream(fs.createReadStream("data.txt",{ flags: "r" }));

stream.setDelimiter("\n");

mongoose.connection.on("open",function(err,conn) { 

    // lower level method, needs connection
    var bulk = Entry.collection.initializeOrderedBulkOp();
    var counter = 0;

    stream.on("error",function(err) {
        console.log(err); // or otherwise deal with it
    });

    stream.on("line",function(line) {

        async.series(
            [
                function(callback) {
                    var row = line.split(",");     // split the lines on delimiter
                    var obj = {};             
                    // other manipulation

                    bulk.insert(obj);  // Bulk is okay if you don't need schema
                                       // defaults. Or can just set them.

                    counter++;

                    if ( counter % 1000 == 0 ) {
                        stream.pause();
                        bulk.execute(function(err,result) {
                            if (err) callback(err);
                            // possibly do something with result
                            bulk = Entry.collection.initializeOrderedBulkOp();
                            stream.resume();
                            callback();
                        });
                    } else {
                        callback();
                    }
               }
           ],
           function (err) {
               // each iteration is done
           }
       );

    });

    stream.on("end",function() {

        if ( counter % 1000 != 0 )
            bulk.execute(function(err,result) {
                if (err) throw err;   // or something
                // maybe look at result
            });
    });

});

因此，通常“流”接口“斷開輸入”以便一次處理“一行”。 這會阻止您立即加載所有內容。

主要部分是MongoDB的“批量操作API” 。 這允許您在實際發送到服務器之前一次“排隊”許多操作。 因此，在這種情況下使用“模數”，只能處理每1000個條目的寫入。 你可以做任何高達16MB BSON限制的事情，但要保持可管理性。

除了批量處理的操作外，還有一個來自異步庫的額外“限制器”。 它並不是真正需要的，但這確保了在任何時候基本上不超過文檔的“模數限制”。 除了內存之外，一般批處理“插入”沒有IO成本，但“執行”調用意味着IO正在處理。 所以我們等待而不是排隊更多的東西。

對於“流處理”CSV類型數據，肯定會找到更好的解決方案。 但總的來說，這為您提供了如何以內存有效的方式執行此操作而不會占用CPU周期的概念。

Answer 2

接受的答案很好，並試圖涵蓋這個問題的所有重要方面。

將CSV文件作為一行流讀取
將文檔分批寫入MongoDB
讀寫同步

雖然它在前兩個方面表現良好，但使用async.series（）解決同步問題的方法將無法按預期工作。

stream.on("line",function(line) {
    async.series(
        [
            function(callback) {
                var row = line.split(",");     // split the lines on delimiter
                var obj = {};             
                // other manipulation

                bulk.insert(obj);  // Bulk is okay if you don't need schema
                                   // defaults. Or can just set them.

                counter++;

                if ( counter % 1000 == 0 ) {
                    bulk.execute(function(err,result) {
                        if (err) throw err;   // or do something
                        // possibly do something with result
                        bulk = Entry.collection.initializeOrderedBulkOp();
                        callback();
                    });
                } else {
                    callback();
                }
           }
       ],
       function (err) {
           // each iteration is done
       }
   );
});

這里的bulk.execute（）是一個mongodb寫操作，它是一個異步IO調用。 這允許node.js在使用db編寫和回調完成bulk.execute（）之前繼續執行事件循環。

因此，它可能繼續從流中接收更多“行”事件並將更多文檔bulk.insert(obj)隊列bulk.insert(obj)並可以命中下一個模數以再次觸發bulk.execute（）。

讓我們來看看這個例子。

var async = require('async');

var bulk = {
    execute: function(callback) {
        setTimeout(callback, 1000);
    }
};

async.series(
    [
       function (callback) {
           bulk.execute(function() {
              console.log('completed bulk.execute');
              callback(); 
           });
       },
    ], 
    function(err) {

    }
);

console.log("!!! proceeding to read more from stream");

這是輸出

!!! proceeding to read more from stream
completed bulk.execute

為了確保我們在任何給定時間處理一批N個文檔，我們需要使用stream.pause() ＆ stream.resume()對文件流強制執行流控制。

var LineInputStream = require("line-input-stream"),
    fs = require("fs"),
    mongoose = require("mongoose"),
    Schema = mongoose.Schema;

var entrySchema = new Schema({},{ strict: false });
var Entry = mongoose.model( "Entry", entrySchema );

var stream = LineInputStream(fs.createReadStream("data.txt",{ flags: "r" }));

stream.setDelimiter("\n");

mongoose.connection.on("open",function(err,conn) { 

    // lower level method, needs connection
    var bulk = Entry.collection.initializeOrderedBulkOp();
    var counter = 0;

    stream.on("error",function(err) {
        console.log(err); // or otherwise deal with it
    });

    stream.on("line",function(line) {
        var row = line.split(",");     // split the lines on delimiter
        var obj = {};             
        // other manipulation

        bulk.insert(obj);  // Bulk is okay if you don't need schema
                           // defaults. Or can just set them.

        counter++;

        if ( counter % 1000 === 0 ) {
            stream.pause(); //lets stop reading from file until we finish writing this batch to db

            bulk.execute(function(err,result) {
                if (err) throw err;   // or do something
                // possibly do something with result
                bulk = Entry.collection.initializeOrderedBulkOp();

                stream.resume(); //continue to read from file
            });
        }
    });

    stream.on("end",function() {
        if ( counter % 1000 != 0 ) {
            bulk.execute(function(err,result) {
                if (err) throw err;   // or something
                // maybe look at result
            });
        }
    });

});

使用mongoose將非常大的CSV保存到mongoDB

問題描述

2 個解決方案

解決方案1
10 已采納 2014-07-31 10:40:42

解決方案2
7 2014-08-01 15:04:29

使用mongoose將非常大的CSV保存到mongoDB

問題描述

2 個解決方案

解決方案1 10 已采納 2014-07-31 10:40:42

解決方案2 7 2014-08-01 15:04:29

解決方案1
10 已采納 2014-07-31 10:40:42

解決方案2
7 2014-08-01 15:04:29