简体   繁体   中英

NodeJS: Detect last new line byte from Readable Stream when reading by Range to parse large CSV file

Description

I have a very large CSV file (around 1 GB) which I want to process in byte chunks of around 10 MB each. For this purpose, I am creating a Readable Stream with byte-range option fs.createReadStream(sampleCSVfile, { start: 0, end: 10000000 })

Problem

Using the above approach, the stream read from the CSV file contains data for the last line which is not entirely complete. I want a way to identify the byte index at which last line break occurred and start my next Readable Stream with that byte index.

Example CSV: (ignore header row)

John,New York,52
Stacy,Chicago,19
Lisa,Indianapolis,40

Sample Operation:

fs.createReadStream(sampleCSVfile, { start: 0, end: 99 })

Data Returned: (trimmed to above-specified byte-range)

John,New York,52
Stacy,Chicago,19
Lisa,I

Required or Expected:

John,New York,52
Stacy,Chicago,19

So, suppose from the stream fetched the last new line ended at byte-index 78, then my next recursive operation will be: fs.createReadStream(sampleCSVfile, { start: 79, end: 178 })

Below is basic code

const fs = require('fs');

let stream =fs.createReadStream('test.csv', { start:0, end:40})

stream.on('data', (data) => {                       
   console.log(data.length);  //
   let a = data.toString()
   console.log(a);
   let i = a.lastIndexOf('\n');
   console.log(i);
   let substr= a.substring(0, i);
   console.log(substr);
   let byteLength= Buffer.byteLength(substr);
   console.log(byteLength);
 });

DEMO : https://repl.it/@sandeepp2016/SpiritedRowdyObject

But there are already a CSV parser like fast-csv or you can use readLine module will allow you to read steam of data line by line more efficiently

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM