简体   繁体   English

使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构

[英]Read Parquet file using Apache Beam Java SDK without providing schema

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in.似乎org.apache.beam.sdk.io.parquet.ParquetIO.readFiles方法需要传入一个架构。

  • Is there a way to avoid the need to pass in the schema?有没有办法避免传递模式的需要?
  • Isn't the schema included in the Parquet file? Parquet 文件中不包含架构吗?
  • What if I am trying to read multiple Parquet files with different schema?如果我尝试读取具有不同架构的多个 Parquet 文件怎么办?

Please find my response inline请找到我的内联回复

  • Is there a way to avoid the need to pass in the schema?有没有办法避免传递模式的需要? Currently there is no mechanism to avoid passing schema of the parquet files目前没有机制可以避免传递镶木地板文件的架构

  • Isn't the schema included in the Parquet file? Parquet 文件中不包含架构吗? Yes that is correct, the metadata in the header as the schema definition of the file.是的,这是正确的,标头中的元数据作为文件的架构定义。 Please refer to BEAM-8344 which is an Open Feature request to support infer schema请参阅BEAM-8344 ,这是一个支持推断模式的开放功能请求

  • What if I am trying to read multiple Parquet files with different schema?如果我尝试读取具有不同架构的多个 Parquet 文件怎么办? You can do something as below, wherein you can pass file patterns and paths and specify different schemas.您可以执行以下操作,其中您可以传递文件模式和路径并指定不同的模式。

  PCollection<FileIO.ReadableFile> files = pipeline
    .apply(FileIO.match().filepattern(options.getInputFilepattern())
    .apply(FileIO.readMatches());

  PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM