简体繁体 English

spark.read.parquet 中的模式推断如何工作？

[英]How does schema inference work in spark.read.parquet?

原文 2022-01-24 06:51:35 8 1 apache-spark/ parquet

I'm trying to read a parquet file on spark and I have a question.我正在尝试阅读 spark 上的镶木地板文件，但我有一个问题。

How is the type inferred when loading a parquet file with spark.read.parquet?使用 spark.read.parquet 加载 parquet 文件时如何推断类型？

1 . 1 . Parquet Type INT32 -> Spark Type IntegerType Parquet 类型INT32 -> Spark 类型IntegerType
2 . 2 . Parquet inferred from actual stored values -> Spark IntegerType从实际存储值推断的 Parquet -> Spark IntegerType

Is there a dictionary for mapping like 1 ?有没有像1这样的映射字典？ Or is it inferred from the actual stored values like 2 ?或者它是从实际存储的值（如2 ）推断出来的？

1 个解决方案

Spark uses the parquet schema to parse it to an internal representation (ie, StructType), it is a bit hard to find this information on spark docs. Spark 使用 parquet 模式将其解析为内部表示（即 StructType），在 spark 文档上很难找到此信息。 I went through the code to find the mapping you are looking for here:我浏览了代码以在此处找到您要查找的映射：

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281

如何并行化 spark.read.parquet()？ - How to parallelize spark.read.parquet()?

为什么spark.read.parquet（）运行2个工作？ - Why spark.read.parquet() runs 2 jobs?

spark.read.parquet 与 pyarrow.hdfs.connect().read_parquet 有什么区别 - what is the difference between spark.read.parquet vs pyarrow.hdfs.connect().read_parquet

带有自定义架构的Spark读取实木复合地板 - Spark read parquet with custom schema

外部非分区 Hive 表显示 0 条记录，即使 parquet 文件有数据，当使用 spark.read.parquet 读取时，它显示正确的数据 - External non-partitioned Hive table shows 0 records even if parquet file has data, when read using spark.read.parquet it shows correct data

如何使用架构推断将RDD [String]写入镶木地板文件？ - How to write RDD[String] to parquet file with schema inference?

如何在Apache Spark中处理更改镶木地板模式 - How to handle changing parquet schema in Apache Spark

Spark 是否在读取时维护镶木地板分区？ - Does Spark maintain parquet partitioning on read?

实木复合地板架构和Spark - Parquet schema and Spark

Spark 镶木地板架构演变 - Spark parquet schema evolution

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何并行化 spark.read.parquet()？ - How to parallelize spark.read.parquet()? 为什么spark.read.parquet（）运行2个工作？ - Why spark.read.parquet() runs 2 jobs? spark.read.parquet 与 pyarrow.hdfs.connect().read_parquet 有什么区别 - what is the difference between spark.read.parquet vs pyarrow.hdfs.connect().read_parquet 带有自定义架构的Spark读取实木复合地板 - Spark read parquet with custom schema 外部非分区 Hive 表显示 0 条记录，即使 parquet 文件有数据，当使用 spark.read.parquet 读取时，它显示正确的数据 - External non-partitioned Hive table shows 0 records even if parquet file has data, when read using spark.read.parquet it shows correct data 如何使用架构推断将RDD [String]写入镶木地板文件？ - How to write RDD[String] to parquet file with schema inference? 如何在Apache Spark中处理更改镶木地板模式 - How to handle changing parquet schema in Apache Spark Spark 是否在读取时维护镶木地板分区？ - Does Spark maintain parquet partitioning on read? 实木复合地板架构和Spark - Parquet schema and Spark Spark 镶木地板架构演变 - Spark parquet schema evolution

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM