在 Nodejs 中解析大型 JSON 文件

Question

我有一個文件，它以 JSON 形式存儲了許多 JavaScript 對象，我需要讀取該文件，創建每個對象，然后對它們執行一些操作（在我的情況下將它們插入數據庫）。 JavaScript 對象可以表示為以下格式：

格式 A：

[{name: 'thing1'},
....
{name: 'thing999999999'}]

或格式 B：

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

請注意， ...表示很多 JSON 對象。 我知道我可以將整個文件讀入 memory ，然后像這樣使用JSON.parse() ：

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

但是，文件可能非常大，我更喜歡使用 stream 來完成此操作。 我看到的 stream 的問題是文件內容可以在任何時候分解成數據塊，那么我如何在這些對象上使用JSON.parse() ？

理想情況下，每個 object 都將作為單獨的數據塊讀取，但我不確定如何執行此操作。

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

注意，我希望防止將整個文件讀入 memory。 時間效率對我來說並不重要。 是的，我可以嘗試一次讀取多個對象並將它們全部插入，但這是一個性能調整 - 我需要一種保證不會導致 memory 過載的方法，無論文件中包含多少對象.

我可以選擇使用FormatA或FormatB或其他東西，請在您的答案中指定。 謝謝！

Answer 1

要逐行處理文件，您只需要將文件的讀取與作用於該輸入的代碼分離。 您可以通過緩沖輸入直到遇到換行符來完成此操作。 假設我們每行有一個 JSON 對象（基本上是格式 B）：

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

每次文件流從文件系統接收數據時，它都會存儲在緩沖區中，然后調用pump 。

如果緩沖區中沒有換行符， pump只會返回而不做任何事情。 下一次流獲取數據時，更多數據（可能還有換行符）將添加到緩沖區中，然后我們將擁有一個完整的對象。

如果有換行符， pump將緩沖區從開頭切掉到換行符，然后將其交給process 。 然后它再次檢查緩沖區中是否有另一個換行符（ while循環）。 通過這種方式，我們可以處理當前塊中讀取的所有行。

最后，每個輸入行調用一次process 。 如果存在，它會去掉回車符（以避免出現行尾問題——LF 與 CRLF），然后調用JSON.parse一行。 此時，您可以對對象執行任何您需要的操作。

請注意， JSON.parse對它接受的輸入內容有嚴格的規定； 你必須用雙引號引用你的標識符和字符串值。 換句話說， {name:'thing1'}會拋出錯誤； 您必須使用{"name":"thing1"} 。

因為一次在內存中不會超過一大塊數據，所以這將是非常高效的內存。 它也會非常快。 快速測試顯示我在 15 毫秒內處理了 10,000 行。

Answer 2

就像我認為編寫流式 JSON 解析器會很有趣一樣，我也想也許我應該快速搜索一下，看看是否已經有可用的解析器。

原來有。

JSONStream “流式傳輸 JSON.parse 和 stringify”

因為我剛剛找到它，我顯然沒有使用過它，所以我不能評論它的質量，但我很想知道它是否有效。

考慮以下 Javascript 和_.isString ：

stream.pipe(JSONStream.parse('*'))
  .on('data', (d) => {
    console.log(typeof d);
    console.log("isString: " + _.isString(d))
  });

如果流是一個對象數組，這將在對象進入時記錄它們。 因此，唯一被緩沖的是一次一個對象。

Answer 3

截至 2014 年 10 月，您可以執行以下操作（使用 JSONStream） - https://www.npmjs.org/package/JSONStream

var fs = require('fs'),
    JSONStream = require('JSONStream'),

var getStream() = function () {
    var jsonData = 'myData.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
}

getStream().pipe(MyTransformToDoWhateverProcessingAsNeeded).on('error', function (err) {
    // handle any errors
});

用一個工作示例來演示：

npm install JSONStream event-stream

數據.json：

{
  "greeting": "hello world"
}

你好.js：

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    es = require('event-stream');

var getStream = function () {
    var jsonData = 'data.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
};

getStream()
    .pipe(es.mapSync(function (data) {
        console.log(data);
    }));

$ node hello.js
// hello world

Answer 4

我有類似的需求，我需要在 node js 中讀取一個大的 json 文件並分塊處理數據並調用 api 並保存在 mongodb 中。 inputFile.json 是這樣的：

{
 "customers":[
       { /*customer data*/},
       { /*customer data*/},
       { /*customer data*/}....
      ]
}

現在我使用 JsonStream 和 EventStream 來同步實現這一點。

var JSONStream = require("JSONStream");
var es = require("event-stream");

fileStream = fs.createReadStream(filePath, { encoding: "utf8" });
fileStream.pipe(JSONStream.parse("customers.*")).pipe(
  es.through(function(data) {
    console.log("printing one customer object read from file ::");
    console.log(data);
    this.pause();
    processOneCustomer(data, this);
    return data;
  }),
  function end() {
    console.log("stream reading ended");
    this.emit("end");
  }
);

function processOneCustomer(data, es) {
  DataModel.save(function(err, dataModel) {
    es.resume();
  });
}

Answer 5

我意識到如果可能的話，您希望避免將整個 JSON 文件讀入內存，但是如果您有可用的內存，那么在性能方面可能不是一個壞主意。 在 json 文件上使用 node.js 的 require() 可以非常快速地將數據加載到內存中。

我進行了兩次測試，以查看從 81MB geojson 文件中打印出每個要素的屬性時的性能如何。

在第一個測試中，我使用var data = require('./geo.json')將整個 geojson 文件讀入內存。 這需要 3330 毫秒，然后打印出每個特征的屬性需要 804 毫秒，總計 4134 毫秒。 然而，node.js 似乎使用了 411MB 的內存。

在第二個測試中，我將@arcseldon 的答案與 JSONStream + 事件流一起使用。 我修改了 JSONPath 查詢以僅選擇我需要的內容。 這一次內存從未超過 82MB，然而，整個過程現在需要 70 秒才能完成！

Answer 6

我寫了一個可以做到這一點的模塊，稱為BFJ 。 具體來說，方法bfj.match可用於將bfj.match分解為離散的 JSON 塊：

const bfj = require('bfj');
const fs = require('fs');

const stream = fs.createReadStream(filePath);

bfj.match(stream, (key, value, depth) => depth === 0, { ndjson: true })
  .on('data', object => {
    // do whatever you need to do with object
  })
  .on('dataError', error => {
    // a syntax error was found in the JSON
  })
  .on('error', error => {
    // some kind of operational error occurred
  })
  .on('end', error => {
    // finished processing the stream
  });

這里， bfj.match返回一個可讀的對象模式流，它將接收解析的數據項，並傳遞 3 個參數：

包含輸入 JSON 的可讀流。
一個謂詞，指示解析的 JSON 中的哪些項目將被推送到結果流。
一個選項對象，指示輸入是以換行符分隔的 JSON（這是為了處理問題中的格式 B，格式 A 不需要）。

被調用后， bfj.match將從輸入流中深度bfj.match解析 JSON，使用每個值調用謂詞來確定是否將該項推送到結果流。 謂詞傳遞了三個參數：

屬性鍵或數組索引（對於頂級項目，這將是undefined的）。
值本身。
JSON 結構中項目的深度（頂級項目為零）。

當然也可以根據需要使用更復雜的謂詞。 如果要對屬性鍵執行簡單匹配，還可以傳遞字符串或正則表達式而不是謂詞函數。

Answer 7

如果您可以控制輸入文件，並且它是一個對象數組，則可以更輕松地解決此問題。 安排在一行上輸出每個記錄的文件，如下所示：

[
   {"key": value},
   {"key": value},
   ...

這仍然是有效的 JSON。

然后，使用 node.js readline 模塊一次處理一行。

var fs = require("fs");

var lineReader = require('readline').createInterface({
    input: fs.createReadStream("input.txt")
});

lineReader.on('line', function (line) {
    line = line.trim();

    if (line.charAt(line.length-1) === ',') {
        line = line.substr(0, line.length-1);
    }

    if (line.charAt(0) === '{') {
        processRecord(JSON.parse(line));
    }
});

function processRecord(record) {
    // Process the records one at a time here! 
}

Answer 8

我使用split npm 模塊解決了這個問題。 通過管道將您的流拆分為拆分，它將“分解流並重新組裝它，以便每一行都是一個塊”。

示例代碼：

var fs = require('fs')
  , split = require('split')
  ;

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var lineStream = stream.pipe(split());
linestream.on('data', function(chunk) {
    var json = JSON.parse(chunk);           
    // ...
});

Answer 9


https.get(url1 , function(response){
        var data = ""; 
        response.on('data', function(chunk) {
            data += chunk.toString(); 
        }) 
        .on('end', function() {
            console.log(data)
        });

Answer 10

使用@josh3736 答案，但對於 ES2021 和 Node.js 16+ 和 async/await + AirBnb 規則：

import fs from 'node:fs';

const file = 'file.json';

/**
 * @callback itemProcessorCb
 * @param {object} item The current item
 */

/**
 * Process each data chunk in a stream.
 *
 * @param {import('fs').ReadStream} readable The readable stream
 * @param {itemProcessorCb} itemProcessor A function to process each item
 */
async function processChunk(readable, itemProcessor) {
  let data = '';
  let total = 0;

  // eslint-disable-next-line no-restricted-syntax
  for await (const chunk of readable) {
    // join with last result, remove CR and get lines
    const lines = (data + chunk).replace('\r', '').split('\n');

    // clear last result
    data = '';

    // process lines
    let line = lines.shift();
    const items = [];

    while (line) {
      // check if isn't a empty line or an array definition
      if (line !== '' && !/[\[\]]+/.test(line)) {
        try {
          // remove the last comma and parse json
          const json = JSON.parse(line.replace(/\s?(,)+\s?$/, ''));
          items.push(json);
        } catch (error) {
          // last line gets only a partial line from chunk
          // so we add this to join at next loop
          data += line;
        }
      }

      // continue
      line = lines.shift();
    }

    total += items.length;

    // Process items in parallel
    await Promise.all(items.map(itemProcessor));
  }

  console.log(`${total} items processed.`);
}

// Process each item
async function processItem(item) {
  console.log(item);
}

// Init
try {
  const readable = fs.createReadStream(file, {
    flags: 'r',
    encoding: 'utf-8',
  });

  processChunk(readable, processItem);
} catch (error) {
  console.error(error.message);
}

對於 JSON，例如：

[
  { "name": "A", "active": true },
  { "name": "B", "active": false },
  ...
]

Answer 11

我認為您需要使用數據庫。 在這種情況下，MongoDB 是一個不錯的選擇，因為它與 JSON 兼容。

更新：您可以使用mongoimport工具將 JSON 數據導入 MongoDB。

mongoimport --collection collection --file collection.json

在 Nodejs 中解析大型 JSON 文件

問題描述

11 個解決方案

解決方案1
91 已采納 2012-08-08 23:26:11

解決方案2
42

解決方案3
36 2014-07-12 05:51:21

解決方案4
23 2016-01-27 17:16:19

解決方案5
22 2016-04-13 07:06:54

解決方案6
9 2018-04-02 17:20:16

解決方案7
5 2016-06-02 15:46:52

解決方案8
4 2015-05-24 05:16:58

解決方案9
0 2022-08-15 04:48:31

解決方案10
0 2022-09-13 18:52:56

解決方案11
-5 2012-08-08 22:34:51

在 Nodejs 中解析大型 JSON 文件

問題描述

11 個解決方案

解決方案1 91 已采納 2012-08-08 23:26:11

解決方案2 42

解決方案3 36 2014-07-12 05:51:21

解決方案4 23 2016-01-27 17:16:19

解決方案5 22 2016-04-13 07:06:54

解決方案6 9 2018-04-02 17:20:16

解決方案7 5 2016-06-02 15:46:52

解決方案8 4 2015-05-24 05:16:58

解決方案9 0 2022-08-15 04:48:31

解決方案10 0 2022-09-13 18:52:56

解決方案11 -5 2012-08-08 22:34:51

解決方案1
91 已采納 2012-08-08 23:26:11

解決方案2
42

解決方案3
36 2014-07-12 05:51:21

解決方案4
23 2016-01-27 17:16:19

解決方案5
22 2016-04-13 07:06:54

解決方案6
9 2018-04-02 17:20:16

解決方案7
5 2016-06-02 15:46:52

解決方案8
4 2015-05-24 05:16:58

解決方案9
0 2022-08-15 04:48:31

解決方案10
0 2022-09-13 18:52:56

解決方案11
-5 2012-08-08 22:34:51