简体   繁体   English

NodeJS、promises、streams - 处理大型 CSV 文件

[英]NodeJS, promises, streams - processing large CSV files

I need to build a function for processing large CSV files for use in a bluebird.map() call.我需要构建一个函数来处理用于 bluebird.map() 调用的大型 CSV 文件。 Given the potential sizes of the file, I'd like to use streaming.鉴于文件的潜在大小,我想使用流媒体。

This function should accept a stream (a CSV file) and a function (that processes the chunks from the stream) and return a promise when the file is read to end (resolved) or errors (rejected).这个函数应该接受一个流(一个 CSV 文件)和一个函数(处理来自流的块)并在文件被读取到结束(解决)或错误(被拒绝)时返回一个承诺。

So, I start with:所以,我开始:

'use strict';

var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');

var pgp = require('pg-promise')({promiseLib: promise});

api.parsers.processCsvStream = function(passedStream, processor) {

  var parser = csv.parse(passedStream, {trim: true});
  passedStream.pipe(parser);

  // use readable or data event?
  parser.on('readable', function() {
    // call processor, which may be async
    // how do I throttle the amount of promises generated
  });

  var db = pgp(api.config.mailroom.fileMakerDbConfig);

  return new Promise(function(resolve, reject) {
    parser.on('end', resolve);
    parser.on('error', reject);
  });

}

Now, I have two inter-related issues:现在,我有两个相互关联的问题:

  1. I need to throttle the actual amount of data being processed, so as to not create memory pressures.我需要限制正在处理的实际数据量,以免造成内存压力。
  2. The function passed as the processor param is going to often be async, such as saving the contents of the file to the db via a library that is promise-based (right now: pg-promise ).作为processor参数传递的函数通常是异步的,例如通过基于承诺的库(现在: pg-promise )将文件的内容保存到数据库中。 As such, it will create a promise in memory and move on, repeatedly.因此,它将在记忆中创建一个承诺并重复前进。

The pg-promise library has functions to manage this, like page() , but I'm not able to wrap my ahead around how to mix stream event handlers with these promise methods. pg-promise库具有管理此功能的函数,例如page() ,但我无法围绕如何将流事件处理程序与这些承诺方法混合使用。 Right now, I return a promise in the handler for readable section after each read() , which means I create a huge amount of promised database operations and eventually fault out because I hit a process memory limit.现在,我在每次read()之后在处理程序中返回一个readable部分的承诺,这意味着我创建了大量承诺的数据库操作,最终因为我达到了进程内存限制而出错。

Does anyone have a working example of this that I can use as a jumping point?有没有人有一个可以用作跳跃点的工作示例?

UPDATE : Probably more than one way to skin the cat, but this works:更新:可能不止一种给猫剥皮的方法,但这有效:

'use strict';

var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');

var pgp = require('pg-promise')({promiseLib: promise});

api.parsers.processCsvStream = function(passedStream, processor) {

  // some checks trimmed out for example

  var db = pgp(api.config.mailroom.fileMakerDbConfig);
  var parser = csv.parse(passedStream, {trim: true});
  passedStream.pipe(parser);

  var readDataFromStream = function(index, data, delay) {
    var records = [];
    var record;
    do {
      record = parser.read();
      if(record != null)
        records.push(record);
    } while(record != null && (records.length < api.config.mailroom.fileParserConcurrency))
    parser.pause();

    if(records.length)
      return records;
  };

  var processData = function(index, data, delay) {
    console.log('processData(' + index + ') > data: ', data);
    parser.resume();
  };

  parser.on('readable', function() {
    db.task(function(tsk) {
      this.page(readDataFromStream, processData);
    });
  });

  return new Promise(function(resolve, reject) {
    parser.on('end', resolve);
    parser.on('error', reject);
  });
}

Anyone sees a potential problem with this approach?任何人都认为这种方法存在潜在问题?

You might want to look at promise-streams你可能想看看承诺流

var ps = require('promise-streams');
passedStream
  .pipe(csv.parse({trim: true}))
  .pipe(ps.map({concurrent: 4}, row => processRowDataWhichMightBeAsyncAndReturnPromise(row)))
  .wait().then(_ => {
    console.log("All done!");
  });

Works with backpressure and everything.适用于背压和一切。

Find below a complete application that correctly executes the same kind of task as you want: It reads a file as a stream, parses it as a CSV and inserts each row into the database.在下面找到一个完整的应用程序,它可以根据需要正确执行相同类型的任务:它将文件作为流读取,将其解析为 CSV 并将每一行插入到数据库中。

const fs = require('fs');
const promise = require('bluebird');
const csv = require('csv-parse');
const pgp = require('pg-promise')({promiseLib: promise});

const cn = "postgres://postgres:password@localhost:5432/test_db";
const rs = fs.createReadStream('primes.csv');

const db = pgp(cn);

function receiver(_, data) {
    function source(index) {
        if (index < data.length) {
            // here we insert just the first column value that contains a prime number;
            return this.none('insert into primes values($1)', data[index][0]);
        }
    }

    return this.sequence(source);
}

db.task(t => {
    return pgp.spex.stream.read.call(t, rs.pipe(csv()), receiver);
})
    .then(data => {
        console.log('DATA:', data);
    }
    .catch(error => {
        console.log('ERROR:', error);
    });

Note that the only thing I changed: using library csv-parse instead of csv , as a better alternative.请注意,我改变的唯一一件事是:使用库csv-parse而不是csv ,作为更好的选择。

Added use of method stream.read from the spex library, which properly serves aReadable stream for use with promises.spex库中添加了方法stream.read 的使用,该方法正确地提供了可读流以与承诺一起使用。

I found a slightly better way of doing the same thing;我找到了一种更好的方法来做同样的事情; with more control.更多的控制。 This is a minimal skeleton with precise parallelism control.这是一个具有精确并行度控制的最小骨架。 With parallel value as one all records are processed in sequence without having the entire file in memory, we can increase parallel value for faster processing.将并行值设为一个,所有记录都按顺序处理,而无需将整个文件放在内存中,我们可以增加并行值以加快处理速度。

      const csv = require('csv');
      const csvParser = require('csv-parser')
      const fs = require('fs');

      const readStream = fs.createReadStream('IN');
      const writeStream = fs.createWriteStream('OUT');

      const transform = csv.transform({ parallel: 1 }, (record, done) => {
                                           asyncTask(...) // return Promise
                                           .then(result => {
                                             // ... do something when success
                                             return done(null, record);
                                           }, (err) => {
                                             // ... do something when error
                                             return done(null, record);
                                           })
                                       }
                                     );

      readStream
      .pipe(csvParser())
      .pipe(transform)
      .pipe(csv.stringify())
      .pipe(writeStream);

This allows doing an async task for each record.这允许为每个记录执行异步任务。

To return a promise instead we can return with an empty promise, and complete it when stream finishes.要返回一个承诺,我们可以返回一个空的承诺,并在流结束时完成它。

    .on('end',function() {
      //do something wiht csvData
      console.log(csvData);
    });

So to say you don't want streaming but some kind of data chunks?所以说你不想要流式传输而是某种数据块? ;-) ;-)

Do you know https://github.com/substack/stream-handbook ?你知道https://github.com/substack/stream-handbook吗?

I think the simplest approach without changing your architecture would be some kind of promise pool.我认为不改变架构的最简单方法是某种承诺池。 eg https://github.com/timdp/es6-promise-pool例如https://github.com/timdp/es6-promise-pool

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM