简体   繁体   English

在写入nodejs中的文件之前对数据流进行排序

[英]Sorting a data stream before writing to file in nodejs

I have an input file which may potentially contain upto 1M records and each record would look like this 我有一个输入文件,可能包含最多1M条记录,每条记录都是这样的

field 1 field 2 field3 \\n

I want to read this input file and sort it based on field3 before writing it to another file. 我想读取此输入文件并在将其写入另一个文件之前根据field3对其进行排序。

here is what I have so far 这是我到目前为止所拥有的

var fs = require('fs'),
    readline = require('readline'),
    stream = require('stream');

var start = Date.now();

var outstream = new stream;
outstream.readable = true;
outstream.writable = true;

var rl = readline.createInterface({
    input: fs.createReadStream('cross.txt'),
    output: outstream,
    terminal: false
});

rl.on('line', function(line) {
    //var tmp = line.split("\t").reverse().join('\t') + '\n';
    //fs.appendFileSync("op_rev.txt", tmp );
    // this logic to reverse and then sort is too slow
});

rl.on('close', function() {
    var closetime = Date.now();
    console.log('Read entirefile. ', (closetime - start)/1000, ' secs');
});

I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it 我基本上停留在这一点上,我只有从一个文件读取并写入另一个文件的能力,有没有办法在写入之前有效地对这些数据进行排序

DB and sort-stream are fine solutions, but DB might be an overkill and I think sort-stream eventually just sorts the entire file in an in-memory array (on through end callback), so I think performance will be roughly the same, comparing to the original solution. DBsort-stream是很好的解决方案, DB可能是一个过度杀手,我认为sort-stream最终只是将整个文件排序到内存数组中( through端回调),所以我认为性能大致相同,与原始解决方案相比。
(but I haven't ran any benchmarks, so I might be wrong). (但我没有运行任何基准,所以我可能错了)。

So, just for the hack of it, I'll throw in another solution :) 所以,只是为了它的黑客,我将投入另一个解决方案:)


EDIT: I was curious to see how big a difference this will be, so I ran some benchmarks. 编辑:我很想知道这会有多大差异,所以我运行了一些基准测试。

Results were surprising even to me, turns out sort -k3,3 solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb and sort-stream solutions are at least x18 times slower than the original solution (ie at least x180 times slower than sort -k3,3 ). 结果令人惊讶,即使对我而言, sort -k3,3解决方案目前为止比原始解决方案(简单数组排序) 快x10倍 ,而nedbsort-stream解决方案至少比原始解决方案慢x18倍解决方案(即比sort -k3,3慢至少x180倍)。

(See benchmark results below) (见下面的基准测试结果)


If on a *nix machine (Unix, Linux, Mac, ...) you can simply use 如果在* nix机器(Unix,Linux,Mac,...)上,你可以简单地使用
sort -k 3,3 yourInputFile > op_rev.txt and let the OS do the sorting for you. sort -k 3,3 yourInputFile > op_rev.txt让操作系统为你做排序。
You'll probably get better performance, since sorting is done natively. 您可能会获得更好的性能,因为排序是本机完成的。

Or, if you want to process the sorted output in Node: 或者,如果要在Node中处理已排序的输出:

var util = require('util'),
    spawn = require('child_process').spawn,
    sort = spawn('sort', ['-k3,3', './test.tsv']);

sort.stdout.on('data', function (data) {
    // process data
    data.toString()
        .split('\n')
        .map(line => line.split("\t"))
        .forEach(record => console.info(`Record: ${record}`));
});

sort.on('exit', function (code) {
    if (code) {
        // handle error
    }

    console.log('Done');
});

// optional
sort.stderr.on('data', function (data) {
    // handle error...
    console.log('stderr: ' + data);
});

Hope this helps :) 希望这可以帮助 :)


EDIT: Adding some benchmark details. 编辑:添加一些基准细节。

I was curious to see how big a difference this will be, so I ran some benchmarks. 我很想知道这会有多大的不同,所以我运行了一些基准测试。

Here are the results (running on a MacBook Pro): 以下是结果(在MacBook Pro上运行):

  • sort1 uses a straightforward approach, sorting the records in an in-memory array . sort1使用简单的方法,对in-memory array排序。
    Avg time: 35.6s (baseline) 平均时间: 35.6秒 (基线)

  • sort2 uses sort-stream , as suggested by Joe Krill. sort2使用sort-stream ,如Joe Krill所建议的那样。
    Avg time: 11.1m (about x18.7 times slower ) 平均时间: 11.1米 (约慢x18.7倍
    (I wonder why. I didn't dig in.) (我想知道为什么。我没有深入挖掘。)

  • sort3 uses nedb , as suggested by Tamas Hegedus. sort3使用nedb ,如Tamas Hegedus所建议的那样。
    Time: about 16m (about x27 times slower ) 时间:约16米 (约慢x27倍

  • sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt in a terminal sort4只能通过在终端中执行sort -k 3,3 input.txt > out4.txtsort -k 3,3 input.txt > out4.txt
    Avg time: 1.2s (about x30 times faster ) 平均时间: 1.2秒(约快30倍

  • sort5 uses sort -k3,3 , and process the response sent to stdout sort5使用sort -k3,3 ,并处理发送到stdout的响应
    Avg time: 3.65s (about x9.7 times faster ) 平均时间: 3.65秒 (约x9.7倍

You can take advantage of streams for something like this. 你可以利用这些流来获得这样的东西。 There's a few NPM modules that will be helpful -- first include them by running 有一些NPM模块会有所帮助 - 首先通过运行包含它们

npm install sort-stream csv-parse stream-transform

from the command line. 从命令行。

Then: 然后:

var fs = require('fs');
var sort = require('sort-stream');
var parse = require('csv-parse');
var transform = require('stream-transform');

// Create a readble stream from the input file.
fs.createReadStream('./cross.txt')
  // Use `csv-parse` to parse the input using a tab character (\t) as the 
  // delimiter. This produces a record for each row which is an array of 
  // field values.
  .pipe(parse({
    delimiter: '\t'
  }))
  // Use `sort-stream` to sort the parsed records on the third field. 
  .pipe(sort(function (a, b) {
    return a[2].localeCompare(b[2]);
  }))
  // Use `stream-transform` to transform each record (an array of fields) into 
  // a single tab-delimited string to be output to our destination text file.
  .pipe(transform(function(row) {
    return row.join('\t') + '\r';
  }))
  // And finally, output those strings to our destination file.
  .pipe(fs.createWriteStream('./cross_sorted.txt'));

i had quite similar issue, needed to perform an external sort . 我有类似的问题,需要执行外部排序

I figured out, after waste a few time on it that i could load up the data on a database and then query out the desired data from it. 我发现,在浪费了一些时间后,我可以加载数据库上的数据,然后从中查询所需的数据。

It not even matter if the inserts aren't ordered, as long as my query result could be. 即使我的查询结果可能,如果插入没有排序也没关系。

Hope it can work for you too. 希望它也适合你。

In order to insert your data on a database, there are plenty of tools on node to perform such task. 为了在数据库中插入数据,节点上有大量工具来执行此类任务。 I have this pet project which does a similar job. 我有这个宠物项目做类似的工作。

I'm also sure that if you search the subject, you'll find much more info. 我也很确定,如果你搜索主题,你会发现更多的信息。

Good luck. 祝好运。

You have two options, depending on how much data is being processed. 您有两种选择,具体取决于正在处理的数据量。 (1M record count with 3 columns doesn't say much about the amount of actual data) (带有3列的1M记录计数对实际数据量没有太多说明)

Load the data in memory, sort in place 将数据加载到内存中,排序到位

var lines = [];
rl.on('line', function(line) {
    lines.push(line.split("\t").reverse());
});

rl.on('close', function() {
    lines.sort(function(a, b) { return compare(a[0], b[0]); });

    // write however you want
    fs.writeFileSync(
        fileName,
        lines.map(function(x) { return x.join("\t"); }).join("\n")
    );
    function compare(a, b) {
        if (a < b) return -1;
        if (a > b) return 1;
        return 0;
    }
});

Load the data in a persistent database, read ordered 将数据加载到持久数据库中,读取有序

Using a database engine of your choice (for example nedb , a pure javascript db for nodejs) 使用您选择的数据库引擎(例如nedbnodejs的纯javascript数据库)

EDIT : It seems that NeDB keeps the whole database in memory, the file is only a persistent copy of the data. 编辑 :似乎NeDB将整个数据库保存在内存中,该文件只是数据的持久副本。 We'll have to search for another implementation. 我们必须搜索另一个实现。 TingoDB looks promising. TingoDB看起来很有前景。

// This code is only to give an idea, not tested in any way

var Datastore = require('nedb');
var db = new Datastore({
    filename: 'path/to/temp/datafile',
    autoload: true
});

rl.on('line', function(line) {
    var tmp = line.split("\t").reverse();
    db.insert({
        field0: tmp[0],
        field1: tmp[1],
        field2: tmp[2]
    });
});

rl.on('close', function() {
    var cursor = db.find({})
            .sort({ field0: 1 }); // sort by field0, ascending
    var PAGE_SIZE = 1000;
    paginate(0);
    function paginate(i) {
        cursor.skip(i).take(PAGE_SIZE).exec(function(err, docs) {
            // handle errors

            var tmp = docs.map(function(o) {
                return o.field0 + "\t" + o.field1 + "\t" + o.field2 + "\n";
            });
            fs.appendFileSync("op_rev.txt", tmp.join(""));
            if (docs.length >= PAGE_SIZE) {
                paginate(i + PAGE_SIZE);
            } else {
                // cleanup temp database
            }
        });
    }
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM