[英]Read Parquet file using Apache Beam Java SDK without providing schema
It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles
method requires a schema to be passed in.似乎
org.apache.beam.sdk.io.parquet.ParquetIO.readFiles
方法需要传入一个架构。
Please find my response inline请找到我的内联回复
Is there a way to avoid the need to pass in the schema?有没有办法避免传递模式的需要? Currently there is no mechanism to avoid passing schema of the parquet files
目前没有机制可以避免传递镶木地板文件的架构
Isn't the schema included in the Parquet file? Parquet 文件中不包含架构吗? Yes that is correct, the metadata in the header as the schema definition of the file.
是的,这是正确的,标头中的元数据作为文件的架构定义。 Please refer to BEAM-8344 which is an Open Feature request to support infer schema
请参阅BEAM-8344 ,这是一个支持推断模式的开放功能请求
What if I am trying to read multiple Parquet files with different schema?如果我尝试读取具有不同架构的多个 Parquet 文件怎么办? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.
您可以执行以下操作,其中您可以传递文件模式和路径并指定不同的模式。
PCollection<FileIO.ReadableFile> files = pipeline
.apply(FileIO.match().filepattern(options.getInputFilepattern())
.apply(FileIO.readMatches());
PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.