无法从庞大的xlsx文件中获取正确的工作表-使用NodeJS XLSX库

Question

I'm trying to get the data from a huge file (800k rows) and put it into database via lambda (AWS). 我正在尝试从一个巨大的文件（800k行）中获取数据，并通过lambda（AWS）将其放入数据库。 To do that I'm getting the xlsx file from S3 as a buffer and reading it. 为此，我从S3获取xlsx文件作为缓冲区并读取它。

module.exports.getSalesData = new Promise((resolve, reject) => {
  getFileFromS3(filename)
    .then(function (workbook) {
      console.log(workbook.SheetNames[1]); // 'sales'
      console.log(workbook.SheetNames); // showing sales as [ 'main', 'sales', 'Sheet1' ]
      console.log(Array.isArray(workbook.SheetNames)); // true
      console.log(typeof workbook.SheetNames); // Object
      console.log(Object.keys(workbook.Sheets)); // [ 'main', 'Sheet1' ] == why 'sales' is not here?

      var sheet_name = workbook.SheetNames[1]; // sales tab
      var json_sheet = XLSX.utils.sheet_to_json(workbook.Sheets[sheet_name], { raw: true })
      resolve(json_sheet)
    })
    .catch(err => {
      console.log('File: ' + filename + ' doesn\'t exists on S3 or you\'re not connected to the internet.');
    })
})

The issue is that for workbook.Sheets I should see [ 'main', 'sales', 'Sheet1' ] , right? 问题在于workbook.Sheets我应该看到[ 'main', 'sales', 'Sheet1' ] ，对吗？

Then I try to get the number of rows (already converted to JSON) like this: 然后，我尝试像这样获取行数（已转换为JSON）：

getSalesData.then(function (data) {
    console.log(data.length + ' rows');
    console.log(data[0]);
  }).catch(err => console.error(err));

Where the parameter data is the json_sheet defined in the function above. 其中参数data是上面函数中定义的json_sheet 。 So for data.length (number of rows) I get 0 instead of 800k+. 因此，对于data.length （行数），我得到0而不是800k +。 And, of course, I'm unable to get data[0] which is undefined . 而且，当然，我无法获得undefined data[0] 。

PS.: the file has 57.3mb -- Not sure if it's the cause. PS .：该文件有57.3mb-不确定是否是原因。

Thanks in advance for help. 在此先感谢您的帮助。

Answer 1

So basically what was happening is that NodeJS wasn't able to read the full file because it crashes the NodeJS VM memory limit for strings. 因此，基本上发生的是NodeJS无法读取完整文件，因为它使NodeJS VM的字符串内存限制崩溃了。

So what I had to do is to increase the memory limit like this: 所以我要做的是增加内存限制，如下所示：

node --max-old-space-size=2048 services/process/process-sales.js

Which will increase from 512MB to 2048MB / 2GB of memory for NodeJS. NodeJS的内存将从512MB增加到2048MB / 2GB。

But this is just a solution to read large amount of values. 但这只是读取大量值的一种解决方案。

I don't recommend using NodeJS to threat large amount of data like this. 我不建议使用NodeJS这样威胁大量数据。 Instead go with Python using some library like Pandas which is awesome to do it. 而是使用一些很棒的库（例如Pandas）来使用Python。

PS.: Just my opinion and experience by dealing with data using nodejs. PS .：我的观点和经验是使用nodejs处理数据。 I don't think nodejs was made for it. 我不认为nodejs是为此而设计的。

无法从庞大的xlsx文件中获取正确的工作表-使用NodeJS XLSX库

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-03-09 15:31:29

无法从庞大的xlsx文件中获取正确的工作表-使用NodeJS XLSX库

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-03-09 15:31:29

解决方案1
1 已采纳 2018-03-09 15:31:29