无法使用 Node.js 将大量数据填充到 mongodb

Question

I am asked to import a big chunk of weather data collected from many sites all over the city.我被要求导入从全市许多站点收集的大量天气数据。 Each site has 1 computer having one folder, which is being synced to a central server each 5 mins.每个站点都有一台带有一个文件夹的计算机，每 5 分钟同步到一个中央服务器。 Everyday, a new file is created.每天都会创建一个新文件。 So, basically the structure is like this.所以，基本上结构是这样的。 One txt file has format as a csv file, which has the 1st line as fields, and the rest are numbers.一个 txt 文件的格式为 csv 文件，其中第一行为字段，其余为数字。

folder_on_server服务器上的文件夹
|__ site1 __ date1.txt |__ 站点 1 __ 日期 1.txt
| | |__ date2.txt |__ date2.txt
| |
|__ site2 __ date1.txt |__ 站点 2 __ 日期 1.txt
|__ date2.txt |__ date2.txt
I wrote a small node.js app to populate those data onto mongoDB.我编写了一个小的 node.js 应用程序来将这些数据填充到 mongoDB。 However, currently, we have only 3 sites, but each site has almost 900 txt files, each file contains 24*20 = 288 rows (as data is recorded each 5 mins).但是，目前我们只有 3 个站点，但每个站点有近 900 个 txt 文件，每个文件包含 24*20 = 288 行（因为每 5 分钟记录一次数据）。 I tried to run the node app, but after reading about 100 files of the first folder, the program crashes with an error about memory allocation failure.我尝试运行 node 应用程序，但在读取第一个文件夹的大约 100 个文件后，程序崩溃并显示有关内存分配失败的错误。

I have tried many ways to improve this:我尝试了很多方法来改善这一点：

Increase memory size of nodejs to 8GB => a litle better, more files read in but still not able to move on to the next folder.将 nodejs 的内存大小增加到 8GB => 更好一点，读入更多文件但仍然无法移动到下一个文件夹。
Set some variable to null and undefined at the end of the _.forEach loop (I use underscore) => does not help.在 _.forEach 循环结束时将一些变量设置为 null 和 undefined （我使用下划线）=> 没有帮助。
Shift the files array (use fs.readdir), so that the first element will be deleted => does not help either.移动文件数组（使用 fs.readdir），这样第一个元素将被删除 => 也无济于事。

Is there any ways to force js to clean up memory each time it finishes reading a file?有没有什么办法可以强制js每次读完一个文件就清理内存？ Thanks谢谢

Update 1: I ended up adding 100 files in each folders at a time.更新 1：我最终在每个文件夹中一次添加了 100 个文件。 This seems to be tedious but it worked, and this is like one time job.这似乎很乏味，但它奏效了，这就像一次性工作。 However, I still want to find a solution for this.但是，我仍然想为此找到解决方案。

Answer 1

Try using streams instead of loading each file into memory.尝试使用流而不是将每个文件加载到内存中。

I've sent you a pull request with an implementation using streams and async i/o.我已经向您发送了一个使用流和异步 i/o 实现的拉取请求。

This is most of it:这是大部分：

var Async = require('async');
var Csv = require('csv-streamify');
var Es = require('event-stream');
var Fs = require('fs');
var Mapping = require('./folder2siteRef.json');
var MongoClient = require('mongodb').MongoClient;

var sourcePath = '/hnet/incoming/' + new Date().getFullYear();

Async.auto({
  db: function (callback) {
    console.log('opening db connection');
    MongoClient.connect('mongodb://localhost:27017/test3', callback);
  },
  subDirectory: function (callback) {
    // read the list of subfolder, which are sites
    Fs.readdir(sourcePath, callback);
  },
  loadData: ['db', 'subDirectory', function (callback, results) {
    Async.each(results.subDirectory, load(results.db), callback);
  }],
  cleanUp: ['db', 'loadData', function (callback, results) {
    console.log('closing db connection');
    results.db.close(callback);
  }]
}, function (err) {
  console.log(err || 'Done');
});

var load = function (db) {
  return function (directory, callback) {
    var basePath = sourcePath + '/' + directory;
    Async.waterfall([
      function (callback) {
        Fs.readdir(basePath, callback); // array of files in a directory
      },
      function (files, callback) {
        console.log('loading ' + files.length + ' files from ' + directory);
        Async.each(files, function (file, callback) {
          Fs.createReadStream(basePath + '/' + file)
            .pipe(Csv({objectMode: true, columns: true}))
            .pipe(transform(directory))
            .pipe(batch(200))
            .pipe(insert(db).on('end', callback));
        }, callback);
      }
    ], callback);
  };
};

var transform = function (directory) {
  return Es.map(function (data, callback) {
    data.siteRef = Mapping[directory];
    data.epoch = parseInt((data.TheTime - 25569) * 86400) + 6 * 3600;
    callback(null, data);
  });
};

var insert = function (db) {
  return Es.map(
    function (data, callback) {
      if (data.length) {
        var bulk = db.collection('hnet').initializeUnorderedBulkOp();
        data.forEach(function (doc) {
          bulk.insert(doc);
        });
        bulk.execute(callback);
      } else {
        callback();
      }
    }
  );
};

var batch = function (batchSize) {
  batchSize = batchSize || 1000;
  var batch = [];

  return Es.through(
    function write (data) {
      batch.push(data);
      if (batch.length === batchSize) {
        this.emit('data', batch);
        batch = [];
      }
    },
    function end () {
      if (batch.length) {
        this.emit('data', batch);
        batch = [];
      }
      this.emit('end');
    }
  );
};

I've updated your tomongo.js script using streams.我已经使用流更新了您的 tomongo.js 脚本。 I've also changed it to use async instead of sync for its file i/o.我还更改了它的文件 i/o 使用异步而不是同步。

I tested this against the structure defined in your code with small data sets and it worked really well.我使用小数据集根据您的代码中定义的结构对此进行了测试，并且效果非常好。 I did some limited testing against 3xdirs with 900xfiles and 288xlines.我用 900xfiles 和 288xlines 对 3xdirs 做了一些有限的测试。 I'm not sure how big each row of your data is, so i threw a few random properties in. Its quite fast.我不确定你的数据的每一行有多大，所以我加入了一些随机属性。它非常快。 See how it goes with your data.看看它如何处理您的数据。 If it causes issues, you could try throttling it with different write concerns when executing the bulk insert operation.如果它导致问题，您可以尝试在执行批量插入操作时使用不同的写入问题来限制它。

Also check out some of these links for more information on streams in node.js:还可以查看其中一些链接以获取有关 node.js 中流的更多信息：

http://nodestreams.com - a tool written by John Resig with many stream examples. http://nodestreams.com -由 John Resig 编写的带有许多流示例的工具。

And event-stream a very useful streams module.而event-stream是一个非常有用的流模块。

Answer 2

As Robbie said, streams are the way to go with this.正如罗比所说，流是解决这个问题的方法。 fs.createReadStream() should be used instead of .readFileSync() . fs.createReadStream()应使用的.readFileSync() I'd start with creating a line reader that takes a path and whatever string/regex you want to split on:我首先创建一个行阅读器，它采用路径和您想要拆分的任何字符串/正则表达式：

linereader.js linereader.js

var fs = require("fs");
var util = require("util");
var EventEmitter = require("events").EventEmitter;

function LineReader(path, splitOn) {

    var readStream = fs.createReadStream(path);
    var self = this;
    var lineNum = 0;
    var buff = ""
    var chunk;

    readStream.on("readable", function() {

        while( (chunk = readStream.read(100)) !== null) {
            buff += chunk.toString();
            var lines = buff.split(splitOn);

            for (var i = 0; i < lines.length - 1; i++) {
                self.emit("line",lines[i]);
                lineNum += 1;
            }
            buff = lines[lines.length - 1];
        }
    });
    readStream.on("close", function() {
        self.emit("line", buff);
        self.emit("close")
    });
    readStream.on("error", function(err) {
        self.emit("error", err);
    })
}
util.inherits(LineReader, EventEmitter);
module.exports = LineReader;

This will read a text file, and emit "line" events for each line read, so you won't have all of them in memory at once.这将读取一个文本文件，并为读取的每一行发出“行”事件，因此您不会一次将所有这些事件都保存在内存中。 Then, using the async package (or whatever async loop you want to use), loop through the files inserting each document:然后，使用 async 包（或您想要使用的任何异步循环），循环插入每个文档的文件：

app.js应用程序.js

var LineReader = require("./linereader.js");
var async = require("async");

var paths = ["./text1.txt", "./text2.txt", "./path1/text3.txt"];
var reader;

async.eachSeries(paths, function(path, callback) {

    reader = new LineReader(path, /\n/g);

    reader.on("line", function(line) {
        var doc = turnTextIntoObject(line);
        db.collection("mycollection").insert(doc);
    })
    reader.on("close", callback);
    reader.on("error", callback);
}, function(err) {
    // handle error and finish;
})

无法使用 Node.js 将大量数据填充到 mongodb

问题描述

2 个解决方案

解决方案1
1 2015-09-03 23:17:21

解决方案2
1 2015-09-09 12:58:46

无法使用 Node.js 将大量数据填充到 mongodb

问题描述

2 个解决方案

解决方案1 1 2015-09-03 23:17:21

解决方案2 1 2015-09-09 12:58:46

解决方案1
1 2015-09-03 23:17:21

解决方案2
1 2015-09-09 12:58:46