简体   繁体   中英

How to generate parquet file with large amount of data using Java and upload to aws s3 bucket

I'm using setup as described on page: How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

    public void writeToParquet(List<GenericData.Record> recordsToWrite, String fileToWrite) throws IOException {
     Configuration conf = new Configuration();
       conf.set("fs.s3.awsAccessKeyId", "<access_key>");
       conf.set("fs.s3.awsSecretAccessKey", "<secret_key>");

    Path path = new Path(filePath);//filePath = "s3://bucket/folder/data.parquet"
       try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
               .withConf(conf).withRowGroupSize(16 * 1024 * 1024).withPageSize(4 * 1024 * 1024) 
               .build()) {
           for (GenericData.Record record : recordsToWrite) {

       catch(Exception ex) {
        LOGGER.info("ParquetWriter Exception " + ex);

with same version of libs as mentioned by @Sal above. When I use small file having around 5 records all of them get converted fine, but I've a big chunk of records around 800k (source file size 5GB+). I need to convert them to parquet.

Issue 1: When I try to store it on local drive and upload explicitly it hardly comes with 10 records with output file size roughly 5MB.

Issue 2: when I try to upload it directly to S3 as mentioned above I'm facing wired issue I always get an exception after first run

java.io.IOException: File already exists: s3://mybucket/output/folder/path/myfile.parquet

But interestingly file is not present/visible at that path, still this error.

Issue 3: Facing below exception

java.lang.NoSuchFieldError: workaroundNonThreadSafePasswdCalls
    at org.apache.hadoop.io.nativeio.NativeIO.initNative(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO.<clinit>(NativeIO.java:89)
    at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
    at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:290)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:385)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443)
    at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:244)
    at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:273)
    at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:494)

Kindly help Thanks in Advance

I was able to fix the java.io.IOException: File already exists:... error by adding

writer = AvroParquetWriter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM