简体   繁体   English

NodeJS:使用 Pipe 从可读 Stream 写入文件会导致堆 Memory 错误

[英]NodeJS: Using Pipe To Write A File From A Readable Stream Gives Heap Memory Error

I am trying to create 150 million lines of data and write the data into a csv file so that I can insert the data into different databases with little modification.我正在尝试创建 1.5 亿行数据并将数据写入 csv 文件,这样我就可以将数据插入到不同的数据库中而无需修改。

I am using a few functions to generate seemingly random data and pushing the data into the writable stream.我正在使用一些函数来生成看似随机的数据并将数据推送到可写的 stream 中。

The code that I have right now is unsuccessful at handling memory issue.我现在拥有的代码无法成功处理 memory 问题。

After a few hours of research, I am starting to think that I should not be pushing each data at the end of the for loop because it seems that the pipe method simply cannot handle garbage collection this way.经过几个小时的研究,我开始认为我不应该在 for 循环结束时推送每个数据,因为 pipe 方法似乎根本无法以这种方式处理垃圾收集。

Also, I found a few StackOverFlow answers and NodeJS docs that recommend against using push at all.此外,我还发现了一些 StackOverFlow 答案和 NodeJS 文档,它们完全不建议使用 push。

However, I am very new to NodeJS and I feel like I am blocked and do not know how to proceed from here.但是,我对 NodeJS 很陌生,我觉得我被阻止了,不知道如何从这里开始。

If someone can provide me any guidance on how to proceed and give me an example, I would really appreciate it.如果有人可以为我提供有关如何进行的任何指导并给我一个例子,我将非常感激。

Below is a part of my code to give you a better understanding of what I am trying to achieve.下面是我的代码的一部分,可以让您更好地理解我想要实现的目标。

PS - PS -

I have found a way to write successfully handle memory issue without using pipe method at all --I used the drain event-- but I had to start from scratch and now I am curious to know if there is a simple way to handle this memory issue without completely changing this bit of code.我找到了一种方法来成功处理 memory 问题,而根本不使用 pipe 方法——我使用了排水事件——但我必须从头开始,现在我很想知道是否有一种简单的方法来处理这个 ZCD69B49357F06BDFDE8198D17在不完全更改这段代码的情况下发出问题。

Also, I have been trying to avoid using any library because I feel like there should be a relatively easy tweak to make this work without using a library but please tell me if I am wrong.此外,我一直试图避免使用任何库,因为我觉得应该有一个相对简单的调整来使这项工作不使用库,但如果我错了,请告诉我。 Thank you in advance.先感谢您。

// This is my target number of data
const targetDataNum = 150000000; 

// Create readable stream
const readableStream = new Stream.Readable({
  read() {}
});

// Create writable stream
const writableStream = fs.createWriteStream('./database/RDBMS/test.csv');

// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');

// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
  const id = i;
  const body = lorem.paragraph(1);
  const date = randomDate(new Date(2014, 0, 1), new Date());
  const dp = randomNumber(1, 1000);
  const data = `${id},${body},${date},${dp}\n`;
  readableStream.push(data, 'utf8');
};

// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);

// End the stream
readableStream.push(null);

i suggest to try a solution like the following:我建议尝试如下解决方案:

const { Readable } = require('readable-stream');

class CustomReadable extends Readable {
  constructor(max, options = {}) {
    super(options);
    this.targetDataNum = max;
    this.i = 1;
  }

  _read(size) {
    if (i <= this.targetDataNum) {
      // your code to build the csv content
      this.push(data, 'utf8');
      return;
    }
    this.push(null);
  }
}

const rs = new CustomReadable(150000000);

rs.pipe(ws);

Just complete it with your portion of code to fill the csv and create the writable stream.只需使用您的部分代码完成它以填充 csv 并创建可写的 stream。

With this solution you leave calling the rs.push method to the internal _read stream method invoked until this.push(null) is not called.使用此解决方案,您将调用rs.push方法留给调用的内部_read stream 方法,直到this.push(null)未被调用。 Probably before you were filling the internal stream buffer too fast calling push manually in a loop getting the out memory error.可能在您填充内部 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 缓冲区太快之前,在循环中手动调用push会出现 memory 错误。

Since you're new to streams, maybe start with an easier abstraction: generators .由于您是流的新手,可能从一个更简单的抽象开始: 生成器 Generators generate data only when it is consumed (just like Streams should), but they don't have buffering and complicated constructors and methods.生成器仅在使用数据时生成数据(就像 Streams 应该的那样),但它们没有缓冲和复杂的构造函数和方法。

This is just your for loop, moved into a generator function:这只是你的for循环,移动到生成器 function 中:

function * generateData(targetDataNum) {
  for (var i = 1; i <= targetDataNum; i += 1) {
    const id = i;
      const body = lorem.paragraph(1);
    const date = randomDate(new Date(2014, 0, 1), new Date());
    const dp = randomNumber(1, 1000);
    yield `${id},${body},${date},${dp}\n`;
  }
}

In Node 12, you can create a Readable stream directly from any iterable , including generators and async generators:在 Node 12 中,您可以直接从任何 iterable创建Readable stream ,包括生成器和异步生成器:

const stream = Readable.from(generateData(), {encoding: 'utf8'})
stream.pipe(writableStream)

Try pipe ing to the WritableStream before you start pumping data into the ReadableStream and yield before you write the next chunk .在开始将数据泵入ReadableStream之前尝试pipeWritableStream并在写入下一个chunk之前yield

...

// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');

// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);

// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
  const id = i;
  const body = lorem.paragraph(1);
  const date = randomDate(new Date(2014, 0, 1), new Date());
  const dp = randomNumber(1, 1000);
  const data = `${id},${body},${date},${dp}\n`;
  readableStream.push(data, 'utf8');

  // somehow YIELD for the STREAM to drain out.

};
...

The entire Stream implementation of Node.js relies on the fact that the wire is slow and that the CPU can actually have a downtime before the next chunk of data comes in from the stream source or till the next chunk of data has been written to the stream destination . The entire Stream implementation of Node.js relies on the fact that the wire is slow and that the CPU can actually have a downtime before the next chunk of data comes in from the stream source or till the next chunk of data has been written to the stream destination .

In the current implementation, since the for-loop has booked up the CPU, there is no downtime for the actual pipe ing of the data to the writestream .在当前的实现中,由于for-loop已经占用了 CPU,因此数据到writestream的实际pipe没有停机时间。 You will be able to catch this if you watch cat test.csv which will not change while the loop is running.如果您watch cat test.csv ,您将能够捕捉到这一点,它在循环运行时不会改变。

As (I am sure) you know, pipe helps in guaranteeing that the data you are working with is buffered in memory only in chunks and not as a whole.正如(我确定)您知道的那样, pipe有助于确保您正在使用的数据仅在chunks中缓冲,而不是作为一个整体。 But that guarantee only holds true if the CPU gets enough downtime to actually drain the data.但这种保证只有在 CPU 获得足够的停机时间来实际耗尽数据时才成立。

Having said all that, I wrapped your entire code into an async IIFE and ran it with an await for a setTimeout which ensures that I yield for the stream to drain the data.说了这么多,我将您的整个代码包装到一个async IIFE中,并在await setTimeout的情况下运行它,以确保我yield stream drain数据。

let fs = require('fs');
let Stream = require('stream');

(async function () {

  // This is my target number of data
  const targetDataNum = 150000000;

  // Create readable stream
  const readableStream = new Stream.Readable({
    read() { }
  });

  // Create writable stream
  const writableStream = fs.createWriteStream('./test.csv');

  // Write columns first
  writableStream.write('id, body, date, dp\n', 'utf8');

  // Pipe readable stream to writeable stream
  readableStream.pipe(writableStream);

  // Then, push a number of data to the readable stream (150M in this case)
  for (var i = 1; i <= targetDataNum; i += 1) {
    console.log(`Pushing ${i}`);
    const id = i;
    const body = `body${i}`;
    const date = `date${i}`;
    const dp = `dp${i}`;
    const data = `${id},${body},${date},${dp}\n`;
    readableStream.push(data, 'utf8');

    await new Promise(resolve => setImmediate(resolve));

  };

  // End the stream
  readableStream.push(null);

})();

This is what top looks like pretty much the whole time I am running this.这就是我运行它的整个过程中top的样子。

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15213 binaek    **  **  ******  *****  ***** * ***.*  0.5   *:**.** node   

Notice the %MEM which stays more-or-less static.请注意%MEM或多或少地保留为 static。

You were running out of memory because you were pre-generating all the data in memory before you wrote any of it to disk.您的 memory 用完了,因为您在将 memory 中的所有数据写入磁盘之前预先生成了所有数据。 Instead, you need a strategy to write is as you generate so you don't have to hold large amounts of data in memory.相反,您需要在生成时编写策略,这样您就不必在 memory 中保存大量数据。

It does not seem like you need .pipe() here because you control the generation of the data (it's not coming from some random readStream).您似乎不需要.pipe()在这里,因为您控制数据的生成(它不是来自一些随机的 readStream)。

So, you can just generate the data and immediately write it and handle the drain event when needed.因此,您可以只生成数据并立即写入并在需要时处理排水事件。 Here's a runnable example (this creates a very large file):这是一个可运行的示例(这会创建一个非常大的文件):

const {once} = require('events');
const fs = require('fs');

// This is my target number of data
const targetDataNum = 150000000;

async function run() {

    // Create writable stream
    const writableStream = fs.createWriteStream('./test.csv');

    // Write columns first
    writableStream.write('id, body, date, dp\n', 'utf8');

    // Then, push a number of data to the readable stream (150M in this case)
    for (let i = 1; i <= targetDataNum; i += 1) {
      const id = i;
      const body = lorem.paragraph(1);
      const date = randomDate(new Date(2014, 0, 1), new Date());
      const dp = randomNumber(1, 1000);
      const data = `${id},${body},${date},${dp}\n`;
      const canWriteMore = writableStream.write(data);
      if (!canWriteMore) {
          // wait for stream to be ready for more writing
          await once(writableStream, "drain");       
      }
    }
    writableStream.end();
}

run().then(() => {
    console.log(done);
}).catch(err => {
    console.log("got rejection: ", err);
});

// placeholders for the functions that were being used
function randomDate(low, high) {
    let rand = randomNumber(low.getTime(), high.getTime());
    return new Date(rand);
}

function randomNumber(low, high) {
    return Math.floor(Math.random() * (high - low)) + low;
}

const lorem = {
    paragraph: function() {
        return "random paragraph";
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM