简体   繁体   English

Javascript - 从 AWS s3 存储桶读取镶木地板数据(使用快速压缩)

[英]Javascript - Read parquet data (with snappy compression) from AWS s3 bucket

In nodeJS , I am trying to read a parquet file (compression='snappy') but not successful.nodeJS 中,我试图读取镶木地板文件(compression='snappy')但没有成功。

I used https://github.com/ironSource/parquetjs npm module to open local file and read it but reader.cursor() throws cryptic error ' not yet implemented '.我使用https://github.com/ironSource/parquetjs npm 模块打开本地文件并读取它,但 reader.cursor() 抛出神秘错误“尚未实现”。 It does not matter which compression (plain, rle, or snappy) was used to create input file, it throws same error.无论使用哪种压缩(plain、rle 或 snappy)来创建输入文件,它都会引发相同的错误。

Here is my code:这是我的代码:

const readParquet = async (fileKey) => {

  const filePath = 'parquet-test-file.plain'; // 'snappy';

  console.log('----- reading file : ', filePath);
  let reader = await parquet.ParquetReader.openFile(filePath);
  console.log('---- ParquetReader initialized....');

  // create a new cursor
  let cursor = reader.getCursor();

  // read all records from the file and print them
  if (cursor) {
    console.log('---- cursor initialized....');

    let record = await cursor.next() ; // this line throws exception
    while (record) {
      console.log(record);
      record = await cursor.next();
    }
  }

  await reader.close();
  console.log('----- done with reading parquet file....');

  return;
};

Call to read:致电阅读:

let dt = readParquet(fileKeys.dataFileKey);
dt
  .then((value) => console.log('--------SUCCESS', value))
  .catch((error) => {
    console.log('-------FAILURE ', error); // Random error
    console.log(error.stack);
  })

More info: 1. I have generated my parquet files in python using pyarrow.parquet 2. I used 'SNAPPY' compression while writing file 3. I can read these files in python without any issue 4. My schema is not fixed (unknown) each time I write parquet file.更多信息: 1. 我使用 pyarrow.parquet 在 python 中生成了我的镶木地板文件 2. 我在写入文件时使用了 'SNAPPY' 压缩 3. 我可以在 python 中读取这些文件而没有任何问题 4. 我的架构不固定(未知)每次我写镶木地板文件。 I do not create schema while writing.我在写作时不创建模式。 5. error.stack prints undefined in console 6. console.log('-------FAILURE ', error); 5. error.stack 在控制台打印undefined 6. console.log('-------FAILURE ', error); prints "not yet implemented"打印“尚未实施”

I would like to know if someone has encountered similar problem and has ideas/solution to share.我想知道是否有人遇到过类似的问题并有想法/解决方案可以分享。 BTW my parquet files are stored on AWS S3 location (unlike in this test code).顺便说一句,我的镶木地板文件存储在 AWS S3 位置(与此测试代码不同)。 I still have to find solution to read parquet file from S3 bucket.我仍然需要找到从 S3 存储桶读取镶木地板文件的解决方案。

Any help, suggestions, code example will be highly appreciated.任何帮助、建议、代码示例将不胜感激。

Use var AWS = require('aws-sdk');使用var AWS = require('aws-sdk'); to get data from S3.从 S3 获取数据。

Then use node-parquet to read parquet file into variable.然后使用node-parquet将 parquet 文件读入变量。

import np = require('node-parquet');

// Read from a file:
var reader = new np.ParquetReader(`file.parquet`);
var parquet_info = reader.info();
var parquet_rows = reader.rows();
reader.close();
parquet_rows = parquet_rows + "\n";

There is a fork of https://github.com/ironSource/parquetjs here: https://github.com/ZJONSSON/parquetjs which is a "lite" version of the ironSource project.这里有一个https://github.com/ironSource/parquetjs的分支: https : //github.com/ZJONSSON/parquetjs ,它是 IronSource 项目的“精简版”版本。 You can install it using npm install parquetjs-lite .您可以使用npm install parquetjs-lite安装它。

The ZJONSSON project comes with a function ParquetReader.openS3 , which accepts an s3 client (from version 2 of the AWS SDK) and params ( {Bucket: 'x', Key: 'y'} ). ZJONSSON 项目带有一个函数ParquetReader.openS3 ,它接受一个 s3 客户端(来自 AWS SDK 的第 2 版)和参数( {Bucket: 'x', Key: 'y'} )。 You might want to try and see if that works for you.您可能想尝试看看这是否适合您。

If you are using version 3 of the AWS SDK / S3 client, I have a compatible fork here: https://github.com/entitycs/parquetjs (see tag feature/openS3v3).如果您使用的是 AWS SDK / S3 客户端的第 3 版,我在这里有一个兼容的分支: https : //github.com/entitycs/parquetjs (请参阅标签功能/openS3v3)。

Example usage from the project's README.md:项目 README.md 中的示例用法:

const parquet = require("parquetjs-lite");

const params = {
  Bucket: 'xxxxxxxxxxx',
  Key: 'xxxxxxxxxxx'
};
// v2 example
const AWS = require('aws-sdk');
const client = new AWS.S3({
  accessKeyId: 'xxxxxxxxxxx',
  secretAccessKey: 'xxxxxxxxxxx'
});
let reader = await parquet.ParquetReader.openS3(client,params);

//v3 example
const {S3Client, HeadObjectCommand, GetObjectCommand} = require('@aws-sdk/client-s3');
const client = new S3Client({region:"us-east-1"});
let reader = await parquet.ParquetReader.openS3(
  {S3Client:client, HeadObjectCommand, GetObjectCommand},
  params
);

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用从 AWS S3 getObject 读取 Stream 读取并上传到不同的存储桶 - Use Read Stream from AWS S3 getObject to read and upload to a different bucket 使用JavaScript从AWS S3读取JSON - Read a JSON from AWS S3 Using JavaScript 如何在没有 Aws 密钥和秘密的情况下从 S3 公共存储桶读取 zip 文件 - How to read zip file from S3 public bucket without Aws key and secret AWS S3 Javascript SDK-使用浏览器中的密钥和机密从私有存储区下载文件 - AWS S3 Javascript SDK - download file from private bucket with key and secret in browser 使用JavaScript SDK从AWS s3存储桶中获取getObject:“缺少凭证”错误 - getObject from AWS s3 bucket using Javascript SDK: “missing credential” error 在 Javascript 中存储来自 AWS S3 开发工具包的 listObject 数据 - Storing listObject data from AWS S3 SDK in Javascript aws将对象上传到S3存储桶,并将数据详细信息传递给lambda - aws upload object to S3 bucket and pass details of data to lambda 有时数据没有被 Lambda 写入 AWS S3 存储桶中 - Sometimes data is not getting written in the AWS S3 bucket by Lambda 将文件从一个 AWS 帐户的 S3 存储桶复制到另一个 AWS 帐户的 S3 存储桶 + 使用 NodeJS - Copy files from one AWS account's S3 bucket to another AWS account's S3 bucket + using NodeJS 如何使用 AWS Javascript SDK 从 AWS S3 存储桶中检索对象列表(包括与每个对象关联的元数据)? - How to retrieve list of objects (including the metadata associated with each object) from AWS S3 bucket using AWS Javascript SDK?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM