nodejs中的並行stream大行分隔json文件

Question

我正在使用 createReadstream 讀取一個包含 350M 行的文件，並轉換每一行並將其寫回為行分隔文件。 下面是我用來執行此操作的代碼。

var fs = require("fs");
var args = process.argv.slice(2);
var split = require("split")
fs.createReadStream(args[0])
    .pipe(split(JSON.parse))
    .on('data', function(obj) {
        <data trasformation operation>
    })
    .on('error', function(err) {
    })

紅色 350M 線需要 40 分鍾，而且它只使用一個 CPU 核心。 我有 16 個 CPU 內核。 如何使此行讀取過程並行運行，以便利用至少 10 個內核並在更短的時間內完成整個操作。

我嘗試使用這個模塊 - https://www.npmjs.com/package/parallel-transform 。 但是當我檢查htop時，它仍然是單個 CPU 正在執行操作。

var stream = transform(10, {
    objectMode: true
}, function(data, callback) {
    <data trasformation operation>
    callback(null, data);
});

fs.createReadStream(args[0])
    .pipe(stream)
    .pipe(process.stdout);

流式傳輸時並行讀取文件的更好方法是什么？

Answer 1

您可以嘗試scramjet - 我很樂意找到具有強大多線程用例的人來圍繞此設置適當的測試。

您的代碼將如下所示：

var fs = require("fs");
var {StringStream} = require("scramjet");
var args = process.argv.slice(2);

let i = 0;
let threads = os.cpus().length; // you may want to check this out

StringStream.from(fs.createReadStream(args[0]))
    .lines() // it's better to deserialize this in the threads
    .separate(() => i = ++i % threads)
    .cluster(stream => stream // these will happen in the thread
        .JSONParse()
        .map(yourProcessingFunc) // this can be async as well
    )
    .mux() // if the function above returns something you'll get
           // a stream of results
    .run() // this executes the whole workflow.
    .catch(errorHandler)

您可以使用更好的親和力 function 單獨使用，請參閱此處的文檔，您可以根據數據將數據定向到特定的工作人員。 如果您遇到任何問題，請創建一個 repo，讓我們看看如何解決這些問題。

nodejs中的並行stream大行分隔json文件

問題描述

1 個解決方案

解決方案1
0 2021-02-07 09:45:23

nodejs中的並行stream大行分隔json文件

問題描述

1 個解決方案

解決方案1 0 2021-02-07 09:45:23

解決方案1
0 2021-02-07 09:45:23