Spark -Scala - 将 CSV 文件转换为自定义对象

Question

How to convert csv data to custom object in spark.如何将 csv 数据转换为 spark 中的自定义对象。 Below are my code snippet下面是我的代码片段

val sparkSession = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .master("local[2]")
      .getOrCreate()

    val citiData = sparkSession.read.option("header", "true").option("inferSchema", "true").csv(filePath) // removing header,and applying schema

    //citiData.describe().show()
    import sparkSession.implicits._
    val s: Dataset[CityData] = citiData.as[CityData]

  }
  //Date,Open,High,Low,Close,Volume
  case class CityData(processingDate: java.util.Date, Open: Double, High: Double, Low: Double, Volume: Double)

Sample DataSet:示例数据集：

Date,Open,High,Low,Close,Volume
2006-01-03,490.0,493.8,481.1,492.9,1537660
2006-01-04,488.6,491.0,483.5,483.8,1871020
2006-01-05,484.4,487.8,484.0,486.2,1143160
2006-01-06,488.8,489.0,482.0,486.2,1370250

i have changed to case class CityData input param type to String , then it is causing "cannot resolve ' processingDate ' given input columns: [Volume, Close, High, Date, Low, Open];"我已将 case 类 CityData 输入参数类型更改为 String ，然后导致“无法解析给定输入列的 ' processingDate '：[Volume, Close, High, Date, Low, Open];” exception.例外。

How can i create custom object我如何创建自定义对象
Another tricky here convert to Date object这里的另一个棘手转换为 Date 对象

How can i do ?我能怎么做？ please share your ideas.请分享您的想法。

Answer 1

In your case, if you do not set option header to true, Spark will read columns with String type.在您的情况下，如果您未将选项header设置为 true，Spark 将读取String类型的列。 With option header , you can see;使用选项header ，您可以看到；

val df = sqlContext.read.option("header", true).option("inferSchema", true).csv("pathToFile")
df.printSchema
//Prints
root
|-- Date: timestamp (nullable = true)
|-- Open: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Close: double (nullable = true)
|-- Volume: integer (nullable = true)

If you try to convert rows into CityData , you will get the following error;如果您尝试将行转换为CityData ，您将收到以下错误；

java.lang.UnsupportedOperationException: No Encoder found for java.util.Date

This means, you cannot convert TimestampType directly into java.util.Date .这意味着，您不能将TimestampType直接转换为java.util.Date 。 Here is the type mappings;这是类型映射；

TimestampType => java.sql.Timestamp时间戳类型 => java.sql.Timestamp
DateType => java.sql.Date日期类型 => java.sql.Date

After changing type of processingDate from java.util.Date to java.sql.Timestamp , you will still get an error which says cannot resolve 'processingDate' .将processingDate类型从java.util.Date更改为java.sql.Timestamp ，您仍然会收到一条错误消息，表示cannot resolve 'processingDate' 。 You also need to change name of the field processingDate to Date in CityData .您还需要到外地的变化名processingDate到Date在CityData 。 then you can convert your data set into Dataset[CityData] by using df.as[CityData] .然后您可以使用df.as[CityData]将您的数据集转换为Dataset[CityData] df.as[CityData] 。 I hope it helps!我希望它有帮助！

Spark -Scala - 将 CSV 文件转换为自定义对象

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-10-09 12:22:45

Spark -Scala - 将 CSV 文件转换为自定义对象

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-10-09 12:22:45

解决方案1
1 已采纳 2018-10-09 12:22:45