简体   繁体   English

Spark -Scala - 将 CSV 文件转换为自定义对象

[英]Spark -Scala - Convert CSV file to custom object

How to convert csv data to custom object in spark.如何将 csv 数据转换为 spark 中的自定义对象。 Below are my code snippet下面是我的代码片段

val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .master("local[2]")
      .getOrCreate()

    val citiData = sparkSession.read.option("header", "true").option("inferSchema", "true").csv(filePath) // removing header,and applying schema

    //citiData.describe().show()
    import sparkSession.implicits._
    val s: Dataset[CityData] = citiData.as[CityData]

  }
  //Date,Open,High,Low,Close,Volume
  case class CityData(processingDate: java.util.Date, Open: Double, High: Double, Low: Double, Volume: Double)

Sample DataSet:示例数据集:

Date,Open,High,Low,Close,Volume
2006-01-03,490.0,493.8,481.1,492.9,1537660
2006-01-04,488.6,491.0,483.5,483.8,1871020
2006-01-05,484.4,487.8,484.0,486.2,1143160
2006-01-06,488.8,489.0,482.0,486.2,1370250

i have changed to case class CityData input param type to String , then it is causing "cannot resolve ' processingDate ' given input columns: [Volume, Close, High, Date, Low, Open];"我已将 case 类 CityData 输入参数类型更改为 String ,然后导致“无法解析给定输入列的 ' processingDate ':[Volume, Close, High, Date, Low, Open];” exception.例外。

  1. How can i create custom object我如何创建自定义对象
  2. Another tricky here convert to Date object这里的另一个棘手转换为 Date 对象

How can i do ?我能怎么做 ? please share your ideas.请分享您的想法。

In your case, if you do not set option header to true, Spark will read columns with String type.在您的情况下,如果您未将选项header设置为 true,Spark 将读取String类型的列。 With option header , you can see;使用选项header ,您可以看到;

val df = sqlContext.read.option("header", true).option("inferSchema", true).csv("pathToFile")
df.printSchema
//Prints
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)

If you try to convert rows into CityData , you will get the following error;如果您尝试将行转换为CityData ,您将收到以下错误;

java.lang.UnsupportedOperationException: No Encoder found for java.util.Date

This means, you cannot convert TimestampType directly into java.util.Date .这意味着,您不能将TimestampType直接转换为java.util.Date Here is the type mappings;这是类型映射;

  • TimestampType => java.sql.Timestamp时间戳类型 => java.sql.Timestamp
  • DateType => java.sql.Date日期类型 => java.sql.Date

After changing type of processingDate from java.util.Date to java.sql.Timestamp , you will still get an error which says cannot resolve 'processingDate' .processingDate类型从java.util.Date更改为java.sql.Timestamp ,您仍然会收到一条错误消息,表示cannot resolve 'processingDate' You also need to change name of the field processingDate to Date in CityData .您还需要到外地的变化名processingDateDateCityData then you can convert your data set into Dataset[CityData] by using df.as[CityData] .然后您可以使用df.as[CityData]将您的数据集转换为Dataset[CityData] df.as[CityData] I hope it helps!我希望它有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM