It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles
method requires a schema to be passed in.
Please find my response inline
Is there a way to avoid the need to pass in the schema? Currently there is no mechanism to avoid passing schema of the parquet files
Isn't the schema included in the Parquet file? Yes that is correct, the metadata in the header as the schema definition of the file. Please refer to BEAM-8344 which is an Open Feature request to support infer schema
What if I am trying to read multiple Parquet files with different schema? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.
PCollection<FileIO.ReadableFile> files = pipeline
.apply(FileIO.match().filepattern(options.getInputFilepattern())
.apply(FileIO.readMatches());
PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.