简体   繁体   English

如何使用org.apache.parquet.hadoop.ParquetWriter将NULL值写入镶木地板?

[英]How can I write NULL value to parquet using org.apache.parquet.hadoop.ParquetWriter?

I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files. 我有一个工具,它使用org.apache.parquet.hadoop.ParquetWriter将CSV数据文件转换为镶木地板数据文件。

I can write basic primitive types just fine (INT32, DOUBLE, BINARY string). 我可以很好地编写基本的原始类型(INT32,DOUBLE,BINARY字符串)。

I need to write NULL values, but I do not know how. 我需要写NULL值,但我不知道如何。 I've tried simply writing null with ParquetWriter, and it throws an exception. 我试过用ParquetWriter写一个null ,然后抛出异常。

How can I write NULL using org.apache.parquet.hadoop.ParquetWriter ? 如何使用org.apache.parquet.hadoop.ParquetWriter写入NULL? Is there a nullable type? 有可空的类型吗?

The code I believe is self explanatory: 我相信的代码是自我解释的:

    ArrayList<Type> fields = new ArrayList<>();
    fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT32, "int32_col", null));
    fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.DOUBLE, "double_col", null));
    fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.BINARY, "string_col", null));
    MessageType schema = new MessageType("input", fields);

    Configuration configuration = new Configuration();
    configuration.setQuietMode(true);
    GroupWriteSupport.setSchema(schema, configuration);
    SimpleGroupFactory f = new SimpleGroupFactory(schema);
    ParquetWriter<Group> writer = new ParquetWriter<Group>(
      new Path("output.parquet"),
      new GroupWriteSupport(),
      CompressionCodecName.SNAPPY,
      ParquetWriter.DEFAULT_BLOCK_SIZE,
      ParquetWriter.DEFAULT_PAGE_SIZE,
      1048576,
      true,
      false,
      ParquetProperties.WriterVersion.PARQUET_1_0,
      configuration
    );

    // create row 1 with defined values
    Group group1 = f.newGroup();
    Integer int1 = 100;
    Double double1 = 0.5;
    String string1 = "string-value";
    group1.add(0, int1);
    group1.add(1, double1);
    group1.add(2, string1);
    writer.write(group1);

    // create row 2 with NULL values -- does not work!
    Group group2 = f.newGroup();
    Integer int2 = null;
    Double double2 = null;
    String string2 = null;
    group2.add(0, int2); // <-- throws NullPointerException
    group2.add(1, double2); // <-- throws NullPointerException
    group2.add(2, string2); // <-- throws NullPointerException
    writer.write(group2);

    writer.close();

The solution turns out to be quite simple, just don't write a value: 解决方案结果非常简单,只是不写值:

// create row 1 with defined values
Group group1 = f.newGroup();
Integer int1 = 100;
Double double1 = 0.5;
String string1 = "string-value";
group1.add(0, int1);
group1.add(1, double1);
group1.add(2, string1);
writer.write(group1);

// create row 2 with NULL values -- does not work!
Group group2 = f.newGroup();
// do nothing !
writer.write(group2);

// Now, parquet file will have 2 rows, one with values, one with null values

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 ParquetWriter 将 TIMESTAMP 逻辑类型(INT96)写入镶木地板? - How to write TIMESTAMP logical type (INT96) to parquet, using ParquetWriter? 如何使用 apache 箭头在 java 中编写镶木地板文件 - how to write parquet files in java with apache arrow 是否可以在不依赖Hadoop和HDFS的情况下使用Java读写Parquet? - Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS? 以Apache Parquet格式写入数据 - Write data in Apache Parquet format Apache 依赖错误? org.apache.parquet.hadoop.codec.SnappyCodec 未找到 apache 库中的错误 - Apache dependency bug? org.apache.parquet.hadoop.codec.SnappyCodec was not found Error in apache library Apache Avro Parquet java.lang.NoSuchFieldError: NULL_VALUE - Apache Avro Parquet java.lang.NoSuchFieldError: NULL_VALUE 如何为org.apache.parquet.avro.AvroParquetReader配置S3访问? - How do I configure S3 access for org.apache.parquet.avro.AvroParquetReader? 尝试在数据阶段 11.7 中写入镶木地板文件时出错(File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.Z0238775C7BD96E2EAB9803) - Error while trying to write on parquet file in datastage 11.7 (File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem) org.apache.parquet.schema.InvalidSchemaException:组类型不能为空。 Parquet 不支持没有叶子的空组 - org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves 如何以镶木地板格式编写数据 - How to write data in parquet format
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM