In nodeJS , I am trying to read a parquet file (compression='snappy') but not successful.
I used https://github.com/ironSource/parquetjs npm module to open local file and read it but reader.cursor() throws cryptic error ' not yet implemented '. It does not matter which compression (plain, rle, or snappy) was used to create input file, it throws same error.
Here is my code:
const readParquet = async (fileKey) => {
const filePath = 'parquet-test-file.plain'; // 'snappy';
console.log('----- reading file : ', filePath);
let reader = await parquet.ParquetReader.openFile(filePath);
console.log('---- ParquetReader initialized....');
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
if (cursor) {
console.log('---- cursor initialized....');
let record = await cursor.next() ; // this line throws exception
while (record) {
console.log(record);
record = await cursor.next();
}
}
await reader.close();
console.log('----- done with reading parquet file....');
return;
};
Call to read:
let dt = readParquet(fileKeys.dataFileKey);
dt
.then((value) => console.log('--------SUCCESS', value))
.catch((error) => {
console.log('-------FAILURE ', error); // Random error
console.log(error.stack);
})
More info: 1. I have generated my parquet files in python using pyarrow.parquet 2. I used 'SNAPPY' compression while writing file 3. I can read these files in python without any issue 4. My schema is not fixed (unknown) each time I write parquet file. I do not create schema while writing. 5. error.stack prints undefined in console 6. console.log('-------FAILURE ', error); prints "not yet implemented"
I would like to know if someone has encountered similar problem and has ideas/solution to share. BTW my parquet files are stored on AWS S3 location (unlike in this test code). I still have to find solution to read parquet file from S3 bucket.
Any help, suggestions, code example will be highly appreciated.
Use var AWS = require('aws-sdk');
to get data from S3.
Then use node-parquet
to read parquet file into variable.
import np = require('node-parquet');
// Read from a file:
var reader = new np.ParquetReader(`file.parquet`);
var parquet_info = reader.info();
var parquet_rows = reader.rows();
reader.close();
parquet_rows = parquet_rows + "\n";
There is a fork of https://github.com/ironSource/parquetjs here: https://github.com/ZJONSSON/parquetjs which is a "lite" version of the ironSource project. You can install it using npm install parquetjs-lite
.
The ZJONSSON project comes with a function ParquetReader.openS3
, which accepts an s3 client (from version 2 of the AWS SDK) and params ( {Bucket: 'x', Key: 'y'}
). You might want to try and see if that works for you.
If you are using version 3 of the AWS SDK / S3 client, I have a compatible fork here: https://github.com/entitycs/parquetjs (see tag feature/openS3v3).
Example usage from the project's README.md:
const parquet = require("parquetjs-lite");
const params = {
Bucket: 'xxxxxxxxxxx',
Key: 'xxxxxxxxxxx'
};
// v2 example
const AWS = require('aws-sdk');
const client = new AWS.S3({
accessKeyId: 'xxxxxxxxxxx',
secretAccessKey: 'xxxxxxxxxxx'
});
let reader = await parquet.ParquetReader.openS3(client,params);
//v3 example
const {S3Client, HeadObjectCommand, GetObjectCommand} = require('@aws-sdk/client-s3');
const client = new S3Client({region:"us-east-1"});
let reader = await parquet.ParquetReader.openS3(
{S3Client:client, HeadObjectCommand, GetObjectCommand},
params
);
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
console.log(record);
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.