[英]Spark -Scala - Convert CSV file to custom object
How to convert csv data to custom object in spark.如何将 csv 数据转换为 spark 中的自定义对象。 Below are my code snippet
下面是我的代码片段
val sparkSession = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[2]")
.getOrCreate()
val citiData = sparkSession.read.option("header", "true").option("inferSchema", "true").csv(filePath) // removing header,and applying schema
//citiData.describe().show()
import sparkSession.implicits._
val s: Dataset[CityData] = citiData.as[CityData]
}
//Date,Open,High,Low,Close,Volume
case class CityData(processingDate: java.util.Date, Open: Double, High: Double, Low: Double, Volume: Double)
Sample DataSet:示例数据集:
Date,Open,High,Low,Close,Volume
2006-01-03,490.0,493.8,481.1,492.9,1537660
2006-01-04,488.6,491.0,483.5,483.8,1871020
2006-01-05,484.4,487.8,484.0,486.2,1143160
2006-01-06,488.8,489.0,482.0,486.2,1370250
i have changed to case class CityData input param type to String , then it is causing "cannot resolve ' processingDate
' given input columns: [Volume, Close, High, Date, Low, Open];"我已将 case 类 CityData 输入参数类型更改为 String ,然后导致“无法解析给定输入列的 '
processingDate
':[Volume, Close, High, Date, Low, Open];” exception.例外。
How can i do ?我能怎么做 ? please share your ideas.
请分享您的想法。
In your case, if you do not set option header
to true, Spark will read columns with String
type.在您的情况下,如果您未将选项
header
设置为 true,Spark 将读取String
类型的列。 With option header
, you can see;使用选项
header
,您可以看到;
val df = sqlContext.read.option("header", true).option("inferSchema", true).csv("pathToFile")
df.printSchema
//Prints
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)
If you try to convert rows into CityData
, you will get the following error;如果您尝试将行转换为
CityData
,您将收到以下错误;
java.lang.UnsupportedOperationException: No Encoder found for java.util.Date
This means, you cannot convert TimestampType
directly into java.util.Date
.这意味着,您不能将
TimestampType
直接转换为java.util.Date
。 Here is the type mappings;这是类型映射;
After changing type of processingDate
from java.util.Date
to java.sql.Timestamp
, you will still get an error which says cannot resolve 'processingDate'
.将
processingDate
类型从java.util.Date
更改为java.sql.Timestamp
,您仍然会收到一条错误消息,表示cannot resolve 'processingDate'
。 You also need to change name of the field processingDate
to Date
in CityData
.您还需要到外地的变化名
processingDate
到Date
在CityData
。 then you can convert your data set into Dataset[CityData]
by using df.as[CityData]
.然后您可以使用
df.as[CityData]
将您的数据集转换为Dataset[CityData]
df.as[CityData]
。 I hope it helps!我希望它有帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.