简体   繁体   English

无法从庞大的xlsx文件中获取正确的工作表-使用NodeJS XLSX库

[英]Not able to get right Sheets from huge xlsx files — using NodeJS XLSX library

I'm trying to get the data from a huge file (800k rows) and put it into database via lambda (AWS). 我正在尝试从一个巨大的文件(800k行)中获取数据,并通过lambda(AWS)将其放入数据库。 To do that I'm getting the xlsx file from S3 as a buffer and reading it. 为此,我从S3获取xlsx文件作为缓冲区并读取它。

module.exports.getSalesData = new Promise((resolve, reject) => {
  getFileFromS3(filename)
    .then(function (workbook) {
      console.log(workbook.SheetNames[1]); // 'sales'
      console.log(workbook.SheetNames); // showing sales as [ 'main', 'sales', 'Sheet1' ]
      console.log(Array.isArray(workbook.SheetNames)); // true
      console.log(typeof workbook.SheetNames); // Object
      console.log(Object.keys(workbook.Sheets)); // [ 'main', 'Sheet1' ] == why 'sales' is not here?

      var sheet_name = workbook.SheetNames[1]; // sales tab
      var json_sheet = XLSX.utils.sheet_to_json(workbook.Sheets[sheet_name], { raw: true })
      resolve(json_sheet)
    })
    .catch(err => {
      console.log('File: ' + filename + ' doesn\'t exists on S3 or you\'re not connected to the internet.');
    })
})

The issue is that for workbook.Sheets I should see [ 'main', 'sales', 'Sheet1' ] , right? 问题在于workbook.Sheets我应该看到[ 'main', 'sales', 'Sheet1' ] ,对吗?

Then I try to get the number of rows (already converted to JSON) like this: 然后,我尝试像这样获取行数(已转换为JSON):

getSalesData.then(function (data) {
    console.log(data.length + ' rows');
    console.log(data[0]);
  }).catch(err => console.error(err));

Where the parameter data is the json_sheet defined in the function above. 其中参数data是上面函数中定义的json_sheet So for data.length (number of rows) I get 0 instead of 800k+. 因此,对于data.length (行数),我得到0而不是800k +。 And, of course, I'm unable to get data[0] which is undefined . 而且,当然,我无法获得undefined data[0]

PS.: the file has 57.3mb -- Not sure if it's the cause. PS .:该文件有57.3mb-不确定是否是原因。

Thanks in advance for help. 在此先感谢您的帮助。

So basically what was happening is that NodeJS wasn't able to read the full file because it crashes the NodeJS VM memory limit for strings. 因此,基本上发生的是NodeJS无法读取完整文件,因为它使NodeJS VM的字符串内存限制崩溃了。

So what I had to do is to increase the memory limit like this: 所以我要做的是增加内存限制,如下所示:

node --max-old-space-size=2048 services/process/process-sales.js

Which will increase from 512MB to 2048MB / 2GB of memory for NodeJS. NodeJS的内存将从512MB增加到2048MB / 2GB。

But this is just a solution to read large amount of values. 但这只是读取大量值的一种解决方案。

I don't recommend using NodeJS to threat large amount of data like this. 我不建议使用NodeJS这样威胁大量数据。 Instead go with Python using some library like Pandas which is awesome to do it. 而是使用一些很棒的库(例如Pandas)来使用Python。

PS.: Just my opinion and experience by dealing with data using nodejs. PS .:我的观点和经验是使用nodejs处理数据。 I don't think nodejs was made for it. 我不认为nodejs是为此而设计的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM