[英]Spark incorrectly interpret data type from csv to Double when string ending with 'd'
There is a CSV with a column ID (format: 8-digits & "D" at the end).有一个带有列 ID 的 CSV(格式:8 位数字和末尾的“D”)。 When reading csv with.option("inferSchema", "true"), it returns the data type as double and trimed the "D".使用.option("inferSchema", "true") 读取 csv 时,它返回数据类型为 double 并修剪“D”。
ACADEMIC_YEAR_SEM ACADEMIC_YEAR_SEM | ID ID |
---|---|
2013/1 2013/1 | 12345678D 12345678D |
2013/1 2013/1 | 22345678D 22345678D |
2013/2 2013/2 | 32345678D 32345678D |
Image: https://i.stack.imgur.com/18Nu6.png图片: https://i.stack.imgur.com/18Nu6.png
Is there any idea (apart from inferSchema=False) to get correct result?是否有任何想法(除了 inferSchema=False)以获得正确的结果? Thanks for help!感谢帮助!
You can specify the schema with .schema
and pass a string with columns and their type separated by commas:您可以使用.schema
指定模式,并传递一个包含列及其类型的字符串,以逗号分隔:
df2 = spark.read.format('csv').option("header", "true").schema("ACADEMIC_YEAR_SEM string, ID string")\
.load("pyspark_sample_data.csv")
+-----------------+--------+
|ACADEMIC_YEAR_SEM| ID|
+-----------------+--------+
| 2013/1|1234567D|
| 2013/1|2234567D|
| 2013/2|3234567D|
+-----------------+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.