简体   繁体   中英

how to write parquet files in java with apache arrow

I am trying to write data in java into apache parquet. So far, what i've done is use apache arrow via the examples here: https://arrow.apache.org/cookbook/java/schema.html#creating-fields and create an arrow format dataset.

Question is, how do I write it into parquet after that? Also, do I need to use apache arrow to output the data as a parquet file? or can I use apache parquet directly to serialize the data and then output it as a parquet file?

what i've done:

try (BufferAllocator allocator = new RootAllocator()) {
    Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), null);
    Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), null);
    Schema schemaPerson = new Schema(asList(name, age));
    try(
        VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schemaPerson, allocator)
    ){
        VarCharVector nameVector = (VarCharVector) vectorSchemaRoot.getVector("name");
        nameVector.allocateNew(3);
        nameVector.set(0, "David".getBytes());
        nameVector.set(1, "Gladis".getBytes());
        nameVector.set(2, "Juan".getBytes());
        IntVector ageVector = (IntVector) vectorSchemaRoot.getVector("age");
        ageVector.allocateNew(3);
        ageVector.set(0, 10);
        ageVector.set(1, 20);
        ageVector.set(2, 30);
        vectorSchemaRoot.setRowCount(3);
        File file = new File("randon_access_to_file.arrow");
        try (
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())
        ) {
            writer.start();
            writer.writeBatch();
            writer.end();
            System.out.println("Record batches written: " + writer.getRecordBlocks().size() + ". Number of rows written: " + vectorSchemaRoot.getRowCount());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

but this outputs as an arrow file. not a parquet. Any ideas how I can output this to parquet file instead? And do i need arrow to generate a parquet file to begin with - or can i just use parquet directly?

Arrow Java does not yet support writing to Parquet files, but you can use Parquet to do that.

There is some code in the Arrow dataset test classes that may help. See

org.apache.arrow.dataset.ParquetWriteSupport;
org.apache.arrow.dataset.file.TestFileSystemDataset; 

The second class has some tests that use the utilities in the first one.

You can find them on GitHub here: https://github.com/apache/arrow/tree/master/java/dataset/src/test/java/org/apache/arrow/dataset

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM