简体   繁体   English

在Node.js中解析巨大的二进制文件

[英]Parsing huge binary files in Node.js

I want to create Node.js module which should be able to parse huge binary files (some larger than 200GB). 我想创建Node.js模块,该模块应该能够解析巨大的二进制文件(大于200GB)。 Each file is divided into chunks and each chunk can be larger than 10GB. 每个文件被分成块,每个块可以大于10GB。 I tried using flowing and non-flowing methods to read file, but the problem is because the end of the readed buffer is reached while parsing chunk, so parsing of that chunk must be terminated before the next onData event occurs. 我尝试使用流动和非流动的方法来读取文件,但问题是因为在解析块时达到了重新缓冲区的结束,因此必须在下一个onData事件发生之前终止对该块的解析。 This is what I've tried: 这就是我尝试过的:

var s = getStream();

s.on('data', function(a){
    parseChunk(a);
});

function parseChunk(a){
    /*
        There are a lot of codes and functions.
        One chunk is larger than buffer passed to this function,
        so when the end of this buffer is reached, parseChunk
        function must be terminated before parsing process is finished.
        Also, when the next buffer is passed, it is not the start of
        a new chunk because the previous chunk is not parsed to the end.
    */
}

Loading whole chunk into process memory isn't prossible because I have only 8GB of RAM. 将整个块加载到进程内存中是不可能的,因为我只有8GB的RAM。 How can I synchronously read data from the stream or how can I pause parseChunk function when the end of the buffer is reached and wait until new data is available? 如何从流中同步读取数据,或者如何在达到缓冲区末尾时暂停parseChunk函数并等到新数据可用?

Maybe I'm missing something, but as far as I can tell, I don't see a reason why this couldn't be implemented using streams with a different syntax. 也许我错过了一些东西,但据我所知,我没有看到为什么使用不同语法的流无法实现这一点的原因。 I'd use 我用了

let chunk;
let Nbytes; // # of bytes to read into a chunk
stream.on('readable', ()=>{
  while(chunk = stream.read(Nbytes)!==null) { 
    // call whatever you like on the chunk of data of size Nbytes   
  }
})

Note that if you specify the size of the chunk yourself, like done here, null will be returned if the amount of bytes requested are not available at the end of the stream. 请注意,如果您自己指定块的大小(如此处所做),如果请求的字节数在流末尾不可用,则将返回null This doesn't mean there is no data anymore to stream. 这并不意味着不再有数据流。 So just be aware that you should expect back a 'trimmed' buffer object of size < Nbytes at the end of the file. 所以请注意,您应该期望在文件末尾返回一个大小< Nbytes的“修剪”缓冲区对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM