简体   繁体   English

spark.read.parquet 中的模式推断如何工作?

[英]How does schema inference work in spark.read.parquet?

I'm trying to read a parquet file on spark and I have a question.我正在尝试阅读 spark 上的镶木地板文件,但我有一个问题。

How is the type inferred when loading a parquet file with spark.read.parquet?使用 spark.read.parquet 加载 parquet 文件时如何推断类型?

  • 1 . 1 . Parquet Type INT32 -> Spark Type IntegerType Parquet 类型INT32 -> Spark 类型IntegerType
  • 2 . 2 . Parquet inferred from actual stored values -> Spark IntegerType实际存储值推断的 Parquet -> Spark IntegerType

Is there a dictionary for mapping like 1 ?有没有像1这样的映射字典? Or is it inferred from the actual stored values like 2 ?或者它是从实际存储的值(如2 )推断出来的?

Spark uses the parquet schema to parse it to an internal representation (ie, StructType), it is a bit hard to find this information on spark docs. Spark 使用 parquet 模式将其解析为内部表示(即 StructType),在 spark 文档上很难找到此信息。 I went through the code to find the mapping you are looking for here:我浏览了代码以在此处找到您要查找的映射:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM