NodeJS-内存不足：大数据处理时会杀死进程错误

Question

I have a couple of .csv files that I need to compare against another large .csv file (over 300,000 rows) and I am running into an Out of Memory error on my server. 我有几个.csv文件，我需要将它们与另一个大的.csv文件（超过300,000行）进行比较，并且我的服务器上遇到内存不足错误。 I am running this on a server with 4GB RAM so I am not sure why this is happening but my code looks like this. 我正在具有4GB RAM的服务器上运行此程序，所以我不确定为什么会发生这种情况，但是我的代码如下所示。 I am using the ya-csv to read in the csv lines: 我正在使用ya-csv读取csv行：

var csv = require('ya-csv');
var fs = require('graceful-fs');
var async = require('async');


var first_silo = [];
var second_Silo = [];
var combined = [];

var reader = csv.createCsvFileReader('december_raw.csv', {columnsFromHeader:true,'separator': ','});
var first = csv.createCsvFileReader('first_data.csv', {columnsFromHeader:false,'separator': ','});
var second = csv.createCsvFileReader('second_data.csv', {columnsFromHeader:false,'separator': ','})


async.series([
 //push data from other .csv files into arrays
function(callback){
   first.addListener('data', function(data){
      first_silo.push(data[0]);
   })
   first.addListener('end', function(){
      callback();
   })
},

function(callback){
   second.addListener('data', function(data){
       second_silo.push(data[0]);
   });
   second.addListener('end', function(data){
       callback();
   });
},

function(callback){
    reader.addListener('data', function(data){
       //compare the data from reader to each item in the first array and append the items that get a match to a .csv.
       for(var i=0;i<first_silo.length;i++){
           if(data[0] === first_silo[i]){
               fs.appendFileSync('results.csv', data[0]+","+first_silo[i])
               break;
           }
       } 
    });
},

function(callback){
    reader.addListener('data', function(data){
        //do the same with the first array as the second.
        for(var i=0;i<second_silo.length;i++){
            if(data[0] === second_silo[i]){
               fs.appendFileSync('results.csv', data[0]+","+second_silo[i]);
               break;
            }
        }
    })
}
])

When I do this I dont get the past first_silo comparison. 当我这样做时，我不会得到过去的first_silo比较。 The node app will just stop and I can see an out of memory error when I dmesg. 节点应用程序将立即停止，并且在执行dmesg时会看到内存不足错误。

I have tried to run this program with this flag as well: 我也尝试使用此标志运行该程序：

--max-old-space-size=3000 --max-old-space-size = 3000

I still get the same error. 我仍然遇到相同的错误。

Is there a smarter way to do this? 有更聪明的方法吗？ Any help would be greatly appreciated. 任何帮助将不胜感激。

Answer 1

Your algorithm is running pretty inefficient for a few reasons. 由于某些原因，您的算法运行效率很低。 Please forgive me, but I'm going to do this without using the async.series call you're using. 请原谅我，但是我将不使用您正在使用的async.series调用来执行此操作。 Hopefully it will still be useful. 希望它仍然有用。

First thing's first: I'm making an assumption. 首先是第一件事：我正在做一个假设。 I'm assuming that the data size of your first file december_raw.csv is smaller than your second and third files. 我假设第一个文件december_raw.csv的数据大小小于第二个和第三个文件的数据大小。 Even if this isn't the case, this should still work without running out of memory as long as the file's contents aren't over your memory limitation. 即使不是这种情况，只要文件的内容未超出内存限制，它仍可以在不耗尽内存的情况下正常运行。

Second, you're loading up two arrays at the same time instead of doing one at a time. 其次，您要同时加载两个数组，而不是一次加载一个。 This is basically doubling your memory usage. 这基本上使您的内存使用量增加了一倍。

Third, my hunch is that when you're running csv.createCsvFileReader, you're beginning the stream on all of them at the same time. 第三，我的直觉是，当您运行csv.createCsvFileReader时，您将同时启动所有流。 You likely don't want this. 您可能不想要这个。

Because you're comparing two files to the contents of december_raw.csv , it might be better to load the contents of that file in memory completely, and then stream-compare the other two files to this in series using a callBack and a universal comparison function. 因为您正在将两个文件与december_raw.csv的内容进行比较，所以最好将该文件的内容完全加载到内存中，然后使用callBack和通用比较将该文件中的其他两个文件进行流比较功能。

var csv = require('ya-csv');
var fs = require('graceful-fs');

var reader_silo = []; // a variable that holds the rows of the main csv.

var reader = csv.createCsvFileReader('december_raw.csv', {columnsFromHeader:true,'separator': ','});
reader.addListener('data', function(data){
  reader_silo.push(data[0]); // load each read in row into the array
});

reader.addListener('end', function(){
  //start comparing with first csv file.
  compareRows('first_data.csv', function(){
    // compare with second data
    compareRows('second_data.csv');
  });
});

// the comparison function, takes in the filename, and a callBack if there is one.    
function compareRows(csvFileName, callBack){

  var csvStream = csv.createCsvFileReader(csvFileName, {columnsFromHeader:false,'separator': ','}); // begin stream

  csvStream.addListener('data', function(data){
    for (var i = 0; i < reader_silo.length; i++) {
      if(data[0] === reader_silo[i]){
        fs.appendFileSync('results.csv', data[0]+","+reader_silo[i]);
        break;
      }
    }
  });

  csvStream.addListener('end', function(data){
    // if there's a callBack then we can execute it.
    // in this case the first time it is executed there is a callBack which executes this function again with the next file.
    if(callBack && typeof callBack === "function") callBack();
  });
}

PS. PS。 If your script continues beyond this, you might also want to consider zeroing out the reader_silo when you're done your comparisons. 如果脚本继续执行此操作，则在完成比较后，您可能还需要考虑将reader_silo清零。 So your 'end' listener callBack would look like this: 因此，您的'end'侦听器callBack将如下所示：

reader.addListener('end', function(){
  compareRows('first_data.csv', function(){
    compareRows('second_data.csv', function(){
      reader_silo = [];
    });
  });
});

Answer 2

Here's an even more memory efficient answer, without any assumptions. 在没有任何假设的情况下，这是一个内存效率更高的答案。 In it, you make sure you pass in the smallest CSV file as the first argument to a compareRows function. 在其中，确保将最小的CSV文件作为第一个参数传递给compareRows函数。

This really makes sure that you're being as memory efficient as possible, by keeping only the smallest set possible stored in memory. 通过仅在内存中存储尽可能小的集合，这确实可以确保您尽可能提高存储效率。

var csv = require('ya-csv');
var fs = require('graceful-fs');

var smallFileName = ""; // used to see if we need to really reload the file again.
var smaller_silo = [];

compareRows('smaller.csv', 'larger.csv', function(){
  compareRows('smaller.csv', 'anotherLarger.csv', function(){
    smaller_silo = []; }); // done
});

function compareRows(smallerFileName, largerFileName, callBack){

  var reader;
  if(smallerFileName !== smallFileName){
    smallFileName = smallerFileName;
    reader = csv.createCsvFileReader(smallerFileName, { columnsFromHeader: true, separator: ','});
    reader.addListener('data', function(data){
      smaller_silo.push(data[0]);
    });

    reader.addListener('end', function(){
      compareSmallerToLarger(largerFileName, callBack);
    });
  }
  else{
    compareSmallerToLarger(largerFileName, callBack);
  }
}

function compareSmallerToLarger(largerFileName, callBack){

  var csvStream = csv.createCsvFileReader( largerFileName, { columnsFromHeader: false, 'separator':','});
  csvStream.addListener('data', function(data){
    for (var i = 0; i < smaller_silo.length; i++) {
      if(data[0] === smaller_silo[i]){
        fs.appendFileSync('results.csv', data[0]+","+smaller_silo[i]);
        break;
      }
    }
  });
  csvStream.addListener('end', function(data){
    if(callBack && typeof callBack === "function") callBack();
  });
}

Anyway, I shouldn't obsess over things... 无论如何，我不应该沉迷于事物...

NodeJS-内存不足：大数据处理时会杀死进程错误

问题描述

2 个解决方案

解决方案1
0 2014-01-17 21:49:51

解决方案2
0 已采纳 2014-01-17 22:46:48

NodeJS-内存不足：大数据处理时会杀死进程错误

问题描述

2 个解决方案

解决方案1 0 2014-01-17 21:49:51

解决方案2 0 已采纳 2014-01-17 22:46:48

解决方案1
0 2014-01-17 21:49:51

解决方案2
0 已采纳 2014-01-17 22:46:48