Read Parquet file using Apache Beam Java SDK without providing schema

Question

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in.

Is there a way to avoid the need to pass in the schema?
Isn't the schema included in the Parquet file?
What if I am trying to read multiple Parquet files with different schema?

Answer 1

Please find my response inline

Is there a way to avoid the need to pass in the schema? Currently there is no mechanism to avoid passing schema of the parquet files
Isn't the schema included in the Parquet file? Yes that is correct, the metadata in the header as the schema definition of the file. Please refer to BEAM-8344 which is an Open Feature request to support infer schema
What if I am trying to read multiple Parquet files with different schema? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.

  PCollection<FileIO.ReadableFile> files = pipeline
    .apply(FileIO.match().filepattern(options.getInputFilepattern())
    .apply(FileIO.readMatches());

  PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));

Read Parquet file using Apache Beam Java SDK without providing schema

Question

1 answers

solution1
0 2019-11-26 15:04:53

Read Parquet file using Apache Beam Java SDK without providing schema

Question

1 answers

solution1 0 2019-11-26 15:04:53

solution1
0 2019-11-26 15:04:53