在 Node.js 中處理大型 CSV 上傳

Question

根據這里的先前線程：

...我正在尋找有關處理大型數據上傳文件的更廣泛的建議。

設想：

用戶上傳了一個非常大的 CSV 文件，其中包含數十萬到數百萬行。 它使用 multer 流式傳輸到端點：

const storage = multer.memoryStorage();
const upload = multer({ storage: storage });

router.post("/", upload.single("upload"), (req, res) => {
    //...
});

每一行都被轉換成一個 JSON 對象。 然后將該對象映射到幾個較小的對象中，這些對象需要插入到幾個不同的表中，分布在各種微服務容器中並由其訪問。

async.forEachOfSeries(data, (line, key, callback) => {
    let model = splitData(line);
    //save model.record1, model.record2, etc. sequentially
});

很明顯，使用這種方法我會遇到內存限制。 這樣做的最有效方式是什么？

Answer 1

為了避免內存問題，您需要使用流處理文件 - 簡單地說，增量。 您不是將整個文件加載到內存中，而是讀取每一行，它會得到相應的處理，然后在符合垃圾收集條件后立即進行處理。

在 Node 中，您可以結合使用CSV 流解析器以將二進制內容流式傳輸為 CSV 行和through2 （一種允許您控制流的流的流實用程序）來完成此操作； 在這種情況下，暫時暫停它以允許將行保存在數據庫中。

過程

該過程如下：

您獲取數據流
您通過 CSV 解析器傳輸它
你通過一個 through2 管道它
您將每一行保存在數據庫中
完成保存后，調用cb()移至下一項。

我不熟悉multer但這里有一個使用來自文件的流的示例。

const fs = require('fs')
const csv = require('csv-stream')
const through2 = require('through2')

const stream = fs.createReadStream('foo.csv')
  .pipe(csv.createStream({
      endLine : '\n',
      columns : ['Year', 'Make', 'Model'],
      escapeChar : '"',
      enclosedChar : '"'
  }))
  .pipe(through2({ objectMode: true }, (row, enc, cb) => {
    // - `row` holds the first row of the CSV,
    //   as: `{ Year: '1997', Make: 'Ford', Model: 'E350' }`
    // - The stream won't process the *next* item unless you call the callback
    //  `cb` on it.
    // - This allows us to save the row in our database/microservice and when
    //   we're done, we call `cb()` to move on to the *next* row.
    saveIntoDatabase(row).then(() => {
      cb(null, true)
    })
    .catch(err => {
      cb(err, null)
    })
  }))
  .on('data', data => {
    console.log('saved a row')
  })
  .on('end', () => {
    console.log('end')
  })
  .on('error', err => {
    console.error(err)
  })

// Mock function that emulates saving the row into a database,
// asynchronously in ~500 ms
const saveIntoDatabase = row =>
  new Promise((resolve, reject) =>
    setTimeout(() => resolve(), 500))

示例foo.csv CSV 是這樣的：

1997,Ford,E350
2000,Mercury,Cougar
1998,Ford,Focus
2005,Jaguar,XKR
1991,Yugo,LLS
2006,Mercedes,SLK
2009,Porsche,Boxter
2001,Dodge,Viper

為什么？

這種方法避免了必須在內存中加載整個 CSV。 一旦處理了row ，它就會超出范圍/變得無法訪問，因此它有資格進行垃圾收集。 這就是使這種方法如此有效的內存的原因。 理論上，這允許您處理無限大小的文件。 閱讀流手冊了解更多關於流的信息。

一些技巧

您可能希望每個周期保存/處理超過 1 行（以相同大小的塊）。 在這種情況下，將一些row s 推入一個 Array，處理/保存整個 Array（塊），然后調用cb移動到下一個塊 - 重復該過程。
流發出您可以監聽的事件。 end / error事件對於響應操作是成功還是失敗特別有用。
Express 默認使用流 - 我幾乎可以肯定您根本不需要multer 。

Answer 2

大型 .csv 數據解析和導入

我使用上面的模型將一個 1.7mm x 200 的 csv 數據矩陣導入到 mongo 中，代碼如下。 不可否認，這很慢，我可以在學習如何更好地分塊數據以提高效率方面提供一些幫助，即不是在每次讀取后插入，而是將行累積到 5,10,25k 行的數組中，然后 insertMany 或更好的熟練使用 through2-map 或 through2-filter 方法。 如果有人願意分享一個例子，提前致謝。

require('dotenv').config();
const parse = require('csv-parser');
const fs = require("fs");
const through2 = require('through2')
const db = require('../models');

const file = "myFile.csv"
const rows = [];

//========Constructor Function for Mongo Import after each read======//
function Hotspot(variable1, variable2,...) {
this.variable1 = variable1;
this.variable2 = variable2;
...}

//========Counter so I can monitor progress in console============//
let counter = 0;
const rows = [];

//This function is imported & run in server.js from './scripts' after mongoose connection established//

exports.importCsvData = () => {
    fs.createReadStream(myFile)
        .pipe(parse())  
        .pipe(through2({ objectMode: true }, (row, enc, cb) => {
            let hotspot = new Hotspot(
                `${row["ROW_VARIABLE_COLUMN_1"]}`,
                `${row["ROW_VARIABLE_COLUMN_2"]}`,...)

     db.MongoModel.create(hotspot)
                .then(result => console.log('created', counter++))
                .then(() => { cb(null, true) })
                .catch(err => {
                    cb(err, null)
                })
        }))
        .on('data', (row) => {
            rows.push(row);
        })
        .on('end', () => {
            console.log('read complete')
        })
}

我使用了以下帖子和鏈接：

作為編寫此腳本的基礎和參考。 似乎工作“很好”，除了我昨晚晚上 10 點開始的，到今天早上 7 點 45 分還不到一半。 這比我在嘗試將所有“熱點”對象累積到熱點數組中以批量插入 mongoDB 后收到的"event": "Allocation failed - JavaScript heap out of memory"不足"event": "Allocation failed - JavaScript heap out of memory"錯誤要好。 我對 Node 中的 readStream/through2/csv-parser 和學習相當陌生，但想分享一些有效的東西，並且目前正在工作。

在 Node.js 中處理大型 CSV 上傳

問題描述

2 個解決方案

解決方案1
10 2017-12-27 23:56:07

過程

為什么？

一些技巧

解決方案2
1 2020-02-14 13:18:09

大型 .csv 數據解析和導入

我使用了以下帖子和鏈接：

在 Node.js 中處理大型 CSV 上傳

問題描述

2 個解決方案

解決方案1 10 2017-12-27 23:56:07

過程

為什么？

一些技巧

解決方案2 1 2020-02-14 13:18:09

大型 .csv 數據解析和導入

我使用了以下帖子和鏈接：

解決方案1
10 2017-12-27 23:56:07

解決方案2
1 2020-02-14 13:18:09