使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构

Question

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in.似乎org.apache.beam.sdk.io.parquet.ParquetIO.readFiles方法需要传入一个架构。

Is there a way to avoid the need to pass in the schema?有没有办法避免传递模式的需要？
Isn't the schema included in the Parquet file? Parquet 文件中不包含架构吗？
What if I am trying to read multiple Parquet files with different schema?如果我尝试读取具有不同架构的多个 Parquet 文件怎么办？

Answer 1

Please find my response inline请找到我的内联回复

Is there a way to avoid the need to pass in the schema?有没有办法避免传递模式的需要？ Currently there is no mechanism to avoid passing schema of the parquet files目前没有机制可以避免传递镶木地板文件的架构
Isn't the schema included in the Parquet file? Parquet 文件中不包含架构吗？ Yes that is correct, the metadata in the header as the schema definition of the file.是的，这是正确的，标头中的元数据作为文件的架构定义。 Please refer to BEAM-8344 which is an Open Feature request to support infer schema请参阅BEAM-8344 ，这是一个支持推断模式的开放功能请求
What if I am trying to read multiple Parquet files with different schema?如果我尝试读取具有不同架构的多个 Parquet 文件怎么办？ You can do something as below, wherein you can pass file patterns and paths and specify different schemas.您可以执行以下操作，其中您可以传递文件模式和路径并指定不同的模式。

  PCollection<FileIO.ReadableFile> files = pipeline
    .apply(FileIO.match().filepattern(options.getInputFilepattern())
    .apply(FileIO.readMatches());

  PCollection<GenericRecord> output = files.apply(ParquetIO.readFiles(SCHEMA));

使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构

问题描述

1 个解决方案

解决方案1
0 2019-11-26 15:04:53

使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构

问题描述

1 个解决方案

解决方案1 0 2019-11-26 15:04:53

解决方案1
0 2019-11-26 15:04:53