简体   繁体   English

从 spark/scala 中的 s3 读取时,分区 id 被隐式转换

[英]Partition id getting casted implicitly while reading from s3 in spark/scala

I have source data in s3 and my spark/scala application will read this data and write as parquet files after partitioning it on a new column partition_id .我在 s3 中有源数据,我的 spark/scala 应用程序将读取这些数据并在将其分区到新列partition_id后将其写入为镶木地板文件。 The value of partition_id will be derived by taking first two characters from another id column having an alphanumeric string value. partition_id的值将通过从另一个具有字母数字字符串值的 id 列中获取前两个字符来派生。 For example:例如:

id = 2dedfdg34h, partition_id = 2d

After writing the data into s3, separate partition folders will be created for each partition and everything looks good.将数据写入 s3 后,将为每个分区创建单独的分区文件夹,一切看起来都很好。 For example:例如:

PRE partition_id=2d/
PRE partition_id=01/
PRE partition_id=0e/
PRE partition_id=fg/
PRE partition_id=5f/
PRE partition_id=jk/
PRE partition_id=06/
PRE partition_id=07/

But when I read these s3 files again into a dataframe, values like 1d , 2d , etc are getting converted to 1.0 , 2.0 .但是,当我再次将这些 s3 文件读入 dataframe 时, 1d2d等值将转换为1.02.0

Spark version: 2.4.0火花版本:2.4.0

Please suggest on how to avoid this implicit conversion.请就如何避免这种隐式转换提出建议。

The command used to write and read partitioned data to/from s3:用于向/从 s3 写入和读取分区数据的命令:

dataframe.write.partitionBy("partition_id").option("compression", "gzip").parquet(<path>)
spark.read.parquet(<path>)

The issue here is that Spark erroneously infer that the column type of the partition column is a number.这里的问题是 Spark 错误地推断出分区列的列类型是数字。 This is due to some of the values actullay being numbers (Spark will not look through all of them).这是由于一些实际的值是数字(Spark 不会查看所有这些值)。

What you can do to avoid this is simply turning off the automatic type inference of the partition columns when reading the data.为了避免这种情况,您可以做的只是在读取数据时关闭分区列的自动类型推断。 This will give you the original string values as wanted.这将为您提供所需的原始字符串值。 This can be done as follows:这可以按如下方式完成:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM