简体   繁体   English

当字符串以“d”结尾时,Spark 错误地将数据类型从 csv 解释为 Double

[英]Spark incorrectly interpret data type from csv to Double when string ending with 'd'

There is a CSV with a column ID (format: 8-digits & "D" at the end).有一个带有列 ID 的 CSV(格式:8 位数字和末尾的“D”)。 When reading csv with.option("inferSchema", "true"), it returns the data type as double and trimed the "D".使用.option("inferSchema", "true") 读取 csv 时,它返回数据类型为 double 并修剪“D”。

ACADEMIC_YEAR_SEM ACADEMIC_YEAR_SEM ID ID
2013/1 2013/1 12345678D 12345678D
2013/1 2013/1 22345678D 22345678D
2013/2 2013/2 32345678D 32345678D

Image: https://i.stack.imgur.com/18Nu6.png图片: https://i.stack.imgur.com/18Nu6.png

Is there any idea (apart from inferSchema=False) to get correct result?是否有任何想法(除了 inferSchema=False)以获得正确的结果? Thanks for help!感谢帮助!

You can specify the schema with .schema and pass a string with columns and their type separated by commas:您可以使用.schema指定模式,并传递一个包含列及其类型的字符串,以逗号分隔:

df2 = spark.read.format('csv').option("header", "true").schema("ACADEMIC_YEAR_SEM string, ID string")\
.load("pyspark_sample_data.csv")

+-----------------+--------+
|ACADEMIC_YEAR_SEM|      ID|
+-----------------+--------+
|           2013/1|1234567D|
|           2013/1|2234567D|
|           2013/2|3234567D|
+-----------------+--------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取分区镶木地板时,Spark 错误地将以“d”或“f”结尾的分区名称解释为数字 - Spark incorrectly interpret partition name ending with 'd' or 'f' as number when reading partitioned parquets 使用 SPARK 写入 CSV 时从字符串类型数据中删除双引号 - Removing Double Quotes from String type data while writing CSV using SPARK 如何将字符串列的数据类型更改为流水线中的第二个火花? - How to change the data type of a string column to double in spark as a stage in a pipeline? 火花错误地读取CSV - Spark Incorrectly Reading CSV 如何使用pyspark从CSV格式的Spark中设置正确的数据类型 - How to set the right Data Type in parquet with Spark from a CSV with pyspark 如何使用数组类型列从 CSV 加载数据以触发数据帧 - How to load data, with array type column, from CSV to spark dataframes 如何在将其写入csv时仅从火花数据帧中的数字数据中删除双引号 - how to remove double quote only from numeric data in a spark dataframe while writing it into a csv 如何忽略 Spark Dataframe 中的双引号,我们从 CSV 读取输入数据? - How to ignore double quotes in Spark Dataframe where we read the input data from CSV? 将RDD从类型org.apache.spark.rdd.RDD [(((String,String),Double)]`转换为org.apache.spark.rdd.RDD [(((String),List [Double])]]` - Convert RDD from type `org.apache.spark.rdd.RDD[((String, String), Double)]` to `org.apache.spark.rdd.RDD[((String), List[Double])]` Spark CSV 编写器为空字符串输出双引号 - Spark CSV writer outputs double quotes for empty string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM