简体   繁体   中英

Read Parquet file using Apache Beam Java SDK without providing schema

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in.

  • Is there a way to avoid the need to pass in the schema?
  • Isn't the schema included in the Parquet file?
  • What if I am trying to read multiple Parquet files with different schema?

Please find my response inline

  • Is there a way to avoid the need to pass in the schema? Currently there is no mechanism to avoid passing schema of the parquet files

  • Isn't the schema included in the Parquet file? Yes that is correct, the metadata in the header as the schema definition of the file. Please refer to BEAM-8344 which is an Open Feature request to support infer schema

  • What if I am trying to read multiple Parquet files with different schema? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.

  PCollection<FileIO.ReadableFile> files = pipeline
    .apply(FileIO.match().filepattern(options.getInputFilepattern())
    .apply(FileIO.readMatches());

  PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM