简体   繁体   English

spark Scala RDD 到 DataFrame 日期格式

[英]spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement你能在这个火花概率声明中提供帮助吗

Data -数据 -

empno|ename|designation|manager|hire_date|sal|deptno    
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30

Code:代码:

val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")

val refinedRDD = rawrdd.map( lines => {   
val fields = lines.split("\\|")   (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)  
})

Problem Statement - This is not working -fields(4).toDate, whats is the alternative or what is the usage?问题陈述 - 这不起作用 -fields(4).toDate,有什么替代方法或用途是什么?

What i have tried?我试过什么?

  1. tried replacing it to - to_date(col(fields(4)), "yyy-MM-dd") - Not working尝试将其替换为 - to_date(col(fields(4)), "yyy-MM-dd") - 不工作

2. 2.

Step 1.步骤1。

val refinedRDD = rawrdd.map( lines => {   
val fields = lines.split("\\|")    
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})

Now this tuples are all strings现在这个元组都是字符串

Step 2.第2步。

mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))

Step 3. converting the string tuples to Rows步骤 3. 将字符串元组转换为 Rows

val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))

Step 4.第4步。

val empDF = spark.createDataFrame(rowRDD, mySchema)

This is also not working and gives error related to types.这也不起作用,并给出与类型相关的错误。 to solve this i changed the step 1 as为了解决这个问题,我将步骤 1 更改为

(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)

Now this is giving error for the date type column and i am again at the main problem.现在这是日期类型列的错误,我又遇到了主要问题。

Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.用例 - 使用 textFile Api,将其转换为 dataframe,在其顶部使用自定义模式(StructType)。

This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)这可以使用案例 class 来完成,但在案例 class 的情况下,我也会卡在需要执行字段(4).toDate 的地方(我知道我可以稍后在代码中将字符串转换为日期,但如果上述问题解决方案是可能的)

You can use the following code snippet您可以使用以下代码片段

import org.apache.spark.sql.functions.to_timestamp

scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]

scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')

scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal    |deptno    |ts                 |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK      |9902   |2010-12-17|800.00 |20        |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN   |9698   |2011-02-20|1600.00|30        |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+

enriched_df: Unit = ()

There are multiple ways to cast your data to proper data types.有多种方法可以将数据转换为适当的数据类型。

First: use InferSchema第一:使用InferSchema

val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema

Some time it doesn't work as expected.有时它不会按预期工作。 see details here在此处查看详细信息

Second: provide your own Datatype conversion template二:提供自己的Datatype转换模板

val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"),  (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema

在此处输入图像描述

    //fetch schema details
    val dataTypes = schemaDF.select("columnName", "columnType")
    val listOfElements = dataTypes.collect.map(_.toSeq.toList)
    //creating a map friendly template
    val validationTemplate = (c: Any, t: Any) => {
       val column = c.asInstanceOf[String]
       val typ = t.asInstanceOf[String]
       col(column).cast(typ)
      }

     //Apply datatype conversion template on rawDF 
    val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
    println("Conversion done!")
    convertedDF.show()
    convertedDF.printSchema

在此处输入图像描述

Third: Case Class第三:案例Class

Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.使用ScalaReflection从 caseclass 创建模式,并在加载 DF 时提供此自定义模式。

  import org.apache.spark.sql.catalyst.ScalaReflection
  import org.apache.spark.sql.types._

  case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)

  val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]

  val rawDF = spark.read.schema(schema).option("header", "true").load(path)
  rawDF.printSchema

Hope this will help.希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM