读取分区镶木地板时，Spark 错误地将以“d”或“f”结尾的分区名称解释为数字

Question

I'm using spark.read.parquet() to read from a folder where parquet files are organized in partitions.我正在使用 spark.read.parquet() 从分区中组织镶木地板文件的文件夹中读取。 The result will be wrong when the partition name ends with f or d.当分区名称以 f 或 d 结尾时，结果将是错误的。 Apparently, Spark will intepret them as number instead of string.显然，Spark 会将它们解释为数字而不是字符串。 I have created a minimal test case as below to reproduce the problem.我创建了一个最小的测试用例，如下所示来重现问题。

df = spark.createDataFrame([
            ('9q', 1),
            ('3k', 2),
            ('6f', 3),
            ('7f', 4),
            ('7d', 5),
     ],
     schema='foo string, id integer'
)
df.write.partitionBy('foo').parquet('./tmp_parquet', mode='overwrite')
read_back_df = spark.read.parquet('./tmp_parquet')
read_back_df.show()

The read_back_df will be read_back_df 将是

+---+---+                                                                       
| id|foo|
+---+---+
|  1| 9q|
|  4|7.0|
|  3|6.0|
|  2| 3k|
|  5|7.0|
+---+---+

Notice partition 6f/7f/7d becomes 6.0/7.0/7.0.注意分区 6f/7f/7d 变为 6.0/7.0/7.0。

The spark vesion is 2.4.3.火花版本是 2.4.3。

Answer 1

The behaviour that you see is expected.您看到的行为是预期的。

From the Spark documentation :来自Spark 文档：

Notice that the data types of the partitioning columns are automatically inferred.请注意，分区列的数据类型是自动推断的。

You can disable this feature by setting spark.sql.sources.partitionColumnTypeInference.enabled to False.您可以通过将spark.sql.sources.partitionColumnTypeInference.enabled设置为 False 来禁用此功能。

The following code preserves the strings when reading the parquet file:以下代码在读取 parquet 文件时保留字符串：

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", False)
read_back_df = spark.read.parquet('./tmp_parquet')
read_back_df.show()

prints印刷

+---+---+                                                                       
| id|foo|
+---+---+
|  3| 6f|
|  1| 9q|
|  4| 7f|
|  2| 3k|
|  5| 7d|
+---+---+

读取分区镶木地板时，Spark 错误地将以“d”或“f”结尾的分区名称解释为数字

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-07-01 09:37:51

读取分区镶木地板时，Spark 错误地将以“d”或“f”结尾的分区名称解释为数字

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-07-01 09:37:51

解决方案1
3 已采纳 2020-07-01 09:37:51