使用 pyspark 从 S3 读取 parquet 分区表正在从分区列中删除前导零

Question

I have written a pyspark dataframe as parquet to s3 using EMR(pyspark), this data is partitioned by column(A) which is StringType()我已经使用 EMR(pyspark) 将 pyspark dataframe 作为镶木地板写入 s3，该数据由列 (A) 分区，即 StringType()

in S3 the data looks something like this在 S3 中，数据看起来像这样

table_path:
       A=0003
           part-file.parquet
       A=C456
           part-file.parquet

While I am reading this back as dataframe using pyspark I am loosing leading zeros in column 'A' of the datafram.当我使用 pyspark 将其读回为 dataframe 时，我在数据帧的“A”列中丢失了前导零。 Here is how the data is looking like这是数据的样子

df =  spark.read.parquet(table_path)
df.show()

| A  | B |
| 3  | ..|
|C456| ..|

I don't want to loose the leading zeros here.我不想在这里丢失前导零。 The expected result is:预期结果是：

| A  | B |
|0003| ..|
|C456| ..|

Answer 1

Found the solution of this issue in the delta documentation.在 delta 文档中找到了此问题的解决方案。

Spark has a property enabled by default. Spark 有一个默认启用的属性。 With this property spark tries to infer the schema of the partition column.使用此属性 spark 尝试推断分区列的架构。 For a partition column of string type we can easily go ahead and switch it off.对于字符串类型的分区列，我们可以轻松地提前 go 并将其关闭。

# Update partition data type infer property
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, Window
conf = (SparkConf().set("spark.sql.sources.partitionColumnTypeInference.enabled", False))
sc=SparkSession.builder.config(conf=conf).getOrCreate()

使用 pyspark 从 S3 读取 parquet 分区表正在从分区列中删除前导零

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-04-13 17:21:14

使用 pyspark 从 S3 读取 parquet 分区表正在从分区列中删除前导零

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-04-13 17:21:14

解决方案1
0 已采纳 2020-04-13 17:21:14