[英]Spark CSV Reader quoted numerics
I am currently reading in CSV data using the following code: 我目前正在使用以下代码读取CSV数据:
Dataset<Row> dataset = getSparkSession().read()
.option("header", "true")
.option("quote", '"')
.option("sep", ',')
.schema(schema)
.csv(path)
.toDF();
Which is directed to a CSV file that has rows that look like this: 该文件指向具有以下行的CSV文件:
"abc","city","123"
as well as another file that has rows that look like this: 以及另一个具有如下所示行的文件:
"abc","city",123
The second one works fine because the schema I pass is 第二个工作正常,因为我通过的架构是
string, string, long
the first one results in java.lang.NumberFormatException: For input string: "123"
第一个结果在java.lang.NumberFormatException: For input string: "123"
Is it possible for the CSV reader to properly read CSVs in both valid formats? CSV阅读器能否正确读取两种有效格式的CSV? Assuming options are passed. 假设传递了选项。
I am using Spark 2.1.1 我正在使用Spark 2.1.1
Use inferSchema
property which automatically identifies the data type of the columns. 使用inferSchema
属性可以自动识别列的数据类型。
var data= sparkSession.read
.option("header", hasColumnHeader)
.option("inferSchema", "true").csv(inputPath);
Using your code actually crashes for me. 使用您的代码实际上会使我崩溃。 I suspect that using characters instead of Strings is the culprit . 我怀疑使用字符而不是字符串是罪魁祸首 。 Using '"'.toString
for .option("quote",...)
fixes the crash, and works. Furthermore, you may want to also define the escape character, as in the following code. 对.option("quote",...)
使用'"'.toString
可以解决崩溃问题,并且可以正常工作。此外,您可能还需要定义转义字符,如以下代码所示。
In Cloudera's Spark2, I was able to use the following to parse both quoted and unquoted numbers to DecimalType
, with a pre-defined schema: 在Cloudera的Spark2中,我能够使用以下内容通过预定义模式将带引号和不带引号的数字解析为DecimalType
:
spark.read
.option("mode", "FAILFAST")
.option("escape", "\"")
.option("delimiter", DELIMITER)
.option("header", HASHEADER.toString)
.option("quote", "\"")
.option("nullValue", null)
.option("ignoreLeadingWhiteSpace", value = true)
.schema(SCHEMA)
.csv(PATH)
Examples of parsed numbers (from unit tests): 解析数字的示例(来自单元测试):
1.0
11
"15.23"
""
//empty field
"0.0000000001"
1111111111111.
000000000. //with leading space
This also works in my tests for IntegerType
- it can be parsed regardless of quotes. 这在我对IntegerType
测试中也适用-不管引号如何,都可以对其进行解析。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.