简体   繁体   English

爸爸和高地

[英]PapaParse and Highland

I have to parse a very big CSV file in NodeJS and save it in a database (async operation) that allows up to 500 entries at a time. 我必须在NodeJS中解析一个非常大的CSV文件,并将其保存在一个数据库中(异步操作),一次最多可以输入500个条目。 Due to memory limits I have to stream the CSV file and want to use PapaParse to parse the CSV file (as that worked best in my case). 由于内存限制,我必须流式传输CSV文件,并想使用PapaParse解析CSV文件(在我的情况下效果最好)。

As PapaParse uses a callback style approach to parse Node.js streams I didn't see an easy to combine highland (for batching and data transform) and PapaParse. 由于PapaParse使用回调样式的方法来解析Node.js流,所以我看不到将高地(用于批处理和数据转换)和PapaParse结合起来的简便方法。 So, I tried to use a ParseThrough stream to write data to and read that stream with highland for batching: 因此,我尝试使用ParseThrough流向其中写入数据并以高地读取该流以进行批处理:

const csv = require('papaparse');
const fs = require('fs');
const highland = require('highland');
const { PassThrough } = require('stream');

const passThroughStream = new PassThrough({ objectMode: true });

csv.parse(fileStream, {
  step: function(row) {
    // Write data to stream
    passThroughStream.write(row.data[0]);
  },
  complete: function() {
    // Somehow "end" the stream
    passThroughStream.write(null);
  },
});

highland(passThroughStream)
  .map((data) => {
    // data transform
  })
  .batch(500)
  .map((data) => {
    // Save up to 500 entries in database (async call)
  });

Obviously that doesn't work as is and doesn't do anything really. 显然,这不能按原样工作,并且实际上什么也没做。 Is something like that even possible or even an better way to parse very big CSV files and save the rows in a database (in batches of up to 500)? 有没有可能像这样甚至更好的方法来解析非常大的CSV文件并将行保存到数据库中(最多500个批处理)?

Edit: Using the csv package ( https://www.npmjs.com/package/csv ) it would be possible like so (same for fast-csv ): 编辑:使用csv包( https://www.npmjs.com/package/csv )可能是这样的(与fast-csv相同):

highland(fileStream.pipe(csv.parse()))
  .map((data) => {
    // data transform
  })
  .batch(500)
  .map((data) => {
    // Save up to 500 entries in database (async call)
  });

But unfortunately both NPM packages do not parse the CSV files properly in all cases. 但是不幸的是,两个NPM软件包都不能在所有情况下正确解析CSV文件。

After a quick look at papaparse I decided to implement CSV parser in scramjet . 快速浏览一下之后papaparse我决定执行CSV解析器scramjet

fileStream.pipe(new scramjet.StringStream('utf-8'))
    .csvParse(options)
    .batch(500)
    .map(items => db.insertArray('some_table', items))

I hope that works for you. 希望对您有用。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM