简体   繁体   中英

Not able to apply function to Spark Dataframe Column

I am trying to apply a function to one of my dataframe columns to convert the values. The values in the column are like "20160907" I need value to be "2016-09-07".

I wrote a function like this:

def convertDate(inDate:String ): String = {
   val year = inDate.substring(0,4)
   val month = inDate.substring(4,6)
   val day = inDate.substring(6,8)

   return year+'-'+month+'-'+day
}

And in my spark scala code, I am using this:

def final_Val {
  val oneDF = hiveContext.read.orc("/tmp/new_file.txt")
  val convertToDate_udf = udf(convertToDate _)
  val convertedDf = oneDF.withColumn("modifiedDate", convertToDate_udf(col("EXP_DATE")))
  convertedDf.show()
}

Suprisingly, in spark shell I am able to run without any error. In scala IDE I am getting the below compilation error:

Multiple markers at this line:
not enough arguments for method udf: (implicit evidence$2: 
reflect.runtime.universe.TypeTag[String], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction. Unspecified value parameters evidence$2, evidence$3.

I am using Spark 1.6.2, Scala 2.10.5

Can someone please tell me what I am doing wrong here?

Same code I tried with different functions like in this post: stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column". I am not getting any compilation issues with this code. I am unable to find out the issue with my code

From what I have learned in a spark-summit course, you have to use the sql.functions methods as much as possible. before implementing your own udf you have to check if there's no existing function in the sql.functions package that does the same work. using the existing functions spark can do a lot of optimizations for you and it will not be obliged to serialize and deserialize you data from and to JVM objects.

to achieve the result you want I'm gonna propose this solution :

val oneDF = spark.sparkContext.parallelize(Seq("19931001", "19931001")).toDF("EXP_DATE")
val convertedDF = oneDF.withColumn("modifiedDate", from_unixtime(unix_timestamp($"EXP_DATE", "yyyyMMdd"), "yyyy-MM-dd"))
convertedDF.show()

this gives the following results :

+--------+------------+
|EXP_DATE|modifiedDate|
+--------+------------+
|19931001|  1993-10-01|
|19931001|  1993-10-01|
+--------+------------+

Hope this help. Best Regards

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM