简体   繁体   English

在 Node.JS 中一次读取 N 行大文件

[英]Read a large file N lines at a time in Node.JS

I have a file with 65,000,000 lines, that is about 2gb in size.我有一个包含 65,000,000 行的文件,大小约为 2GB。

I want to read this file in N lines at a time, perform a db insert operation, and then read the next N, with N being, say, 1000 in this case.我想一次在 N 行中读取这个文件,执行一个 db 插入操作,然后读取下一个 N,在这种情况下,N 是 1000。 Insert order doesn't matter, so synchronous is fine.插入顺序无关紧要,所以同步就可以了。

What's the best way of doing this?这样做的最佳方法是什么? I've only found was to either load in 1 line at a time, or methods that read the whole file into memory.我只发现一次加载 1 行,或者将整个文件读入内存的方法。 Sample code below, that I've been using to read the file one line at a time.下面的示例代码,我一直用来一次读取文件一行。 :

var singleFileParser = (file, insertIntoDB) => {
    var lr = new LineByLineReader(file);
    lr.on('error', function(err) {
        // 'err' contains error object
        console.error(err);
        console.error("Error reading file!");
    });

    lr.on('line', function(line) {
        insertIntoDB(line);
    // 'line' contains the current line without the trailing newline character.
    });

    lr.on('end', function() {
        // All lines are read, file is closed now.
    });
};

Something like this should do这样的事情应该做

var cnt = 0;
var tenLines = [];
lr.on('line', function(line) {
    tenLines.push(line);
    if (++cnt >= 10) {
         lr.pause();
         // prepare your SQL statements from tenLines
         dbInsert(<yourSQL>, function(error, returnVal){
            cnt = 0;
            tenLines = [];
            lr.resume();
        });
     }
});

Lines can only be parsed one at a time by someone.某人一次只能解析一行。 So, if you want 10 at once, then you just collect them one at a time until you have collected 10 and then process the 10.因此,如果您一次想要 10 个,那么您只需一次收集它们,直到收集到 10 个,然后再处理 10 个。

I did not think Jarek's code quite worked right so here's a different version that collects 10 lines into an array and then calls dbInsert() :我不认为 Jarek 的代码工作得很好,所以这里有一个不同的版本,它将 10 行收集到一个数组中,然后调用dbInsert()

var tenLines = [];
lr.on('line', function(line) {
    tenLines.push(line);
    if (tenLines.length === 10) {
        lr.pause();
        dbInsert(<yourSQL>, function(error, returnVal){
            if (error) {
                // some sort of error handling here
            }
            tenLines = [];
            lr.resume();
        });
     }
});
// process last set of lines in the tenLines buffer (if any)
lr.on('end', function() {
    if (tenLines.length !== 0) {
        // process last set of lines
        dbInsert(...);
    }
});

Jarek's version seems to call dbInsert() on every line event rather than only every 10th line event and did not process any left over lines at the end of the file if they weren't a perfect multiple of 10 lines long. Jarek 的版本似乎在每个line事件上调用dbInsert()而不是每 10 个行事件,并且如果它们不是 10 行长的完美倍数,则不会处理文件末尾的任何剩余行。

This is my solution inside an async function:这是我在异步函数中的解决方案:

let multipleLines = [];
const filepath = '<file>';
const numberLines = 50;

const lineReader = require('readline').createInterface({
    input: require('fs').createReadStream(filepath)
});

// process lines by numberLines
for await (const line of lineReader) {
    multipleLines.push(line);
    if (multipleLines.length === numberLines) {
        await dbInsert();
        multipleLines = [];
    }
}
// process last set of lines (if any)
if (multipleLines.length !== 0) {
    await dbInsert();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM