简体   繁体   English

使用非原始数据类型创建 UDF function 并在 Spark-sql 查询中使用:Scala

[英]Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i想将它创建为可重用的函数/方法,所以只要它需要我就可以调用它。 I have created below function in my scala class.我在我的 scala class 中创建了下面的 function。

def date_part(date_column:Column) = {
    val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give  value as 01,02...etc

    m1 match {
        case 01 => concat(concat(year(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM- yyyy"))))-1,'-'),substr(year(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))),3,4))
        //etc..
        case _ => "some other logic"
    }
}

but its showing multiple error.但它显示多个错误。

  1. For 01:对于 01:

◾Decimal integer literals may not have a leading zero. ◾十进制 integer 文字可能没有前导零。 (Octal syntax is obsolete.) (八进制语法已过时。)

◾type mismatch; ◾类型不匹配; found: Int(0) required: org.apache.spark.sql.Column.找到:需要 Int(0):org.apache.spark.sql.Column。

  1. For '-':为了 '-':

type mismatch;类型不匹配; found: Char('-') required: org.apache.spark.sql.Column.找到:需要字符('-'):org.apache.spark.sql.Column。

  1. For 'substr':对于'substr':

not found: value substr.未找到:值 substr。

also that if I'm creating any simple function also with type as column I'm not able to register it as I'm getting error not possible in columnar format.and for all primitive data types(String,Long,Int) its working fine.But in my case type is column so I'm not able to do this.Can someone please guide me how should i do this.as of now I found on stack-overflow that i need use this function with df and then need to convert this df as temp table.can someone please guide me any other alternate way so without much changes in my existing code i can use this functionality.另外,如果我要创建任何简单的 function 并且类型为列,我将无法注册它,因为我在列格式中无法获得错误。对于所有原始数据类型(字符串、长整数、整数)它的工作很好。但在我的情况下,类型是列,所以我无法做到这一点。有人可以指导我该怎么做。截至目前,我在堆栈溢出上发现我需要将这个 function 与 df 一起使用,然后需要将此df转换为临时表。有人可以指导我任何其他替代方式,因此无需对现有代码进行太多更改,我就可以使用此功能。

Firstly, Spark will need read a file where data is stored, I guess this file is a CSV but you can use method json insted of csv.首先,Spark 需要读取一个存储数据的文件,我猜这个文件是 CSV 但你可以使用 csv 的方法 json。

Then you can add new columns with a calculated value as follow:然后,您可以添加具有计算值的新列,如下所示:

     import org.apache.spark.sql.functions._

      val df = spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("/path/mydata.csv")

      def transformDate( dateColumn: String, df: DataFrame) : DataFrame = {
         df.withColumn("calculatedCol", month(to_date(from_unixtime(unix_timestamp(col(dateColumn), "dd-MM-yyyy")))))

         df.withColumn("newColumnWithDate",  when(col("calculatedCol") === "01", concat(concat(year(to_date(from_unixtime(unix_timestamp(col("calculatedCol"), "dd-MM- yyyy"))))-1, lit('-')),substring(year(to_date(from_unixtime(unix_timestamp(col("calculatedCol")), "dd-MM-yyyy"))),4,2))
          .when(col("calculatedCol") === "02","some other logic")
          .otherwise("nothing match")))
      }

     // calling your function for the Dataframe you want transform date column:
     transformDate("date_column", df)

Note some functions need a column as argument, not a string value, so use lit() for specify that values.请注意,某些函数需要列作为参数,而不是字符串值,因此请使用 lit() 指定该值。

An UDF is not needed (and in terms of performance is not recommendable) but you can use it in the following way:不需要 UDF(并且在性能方面不推荐),但您可以通过以下方式使用它:

val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
df.withColumn("upper", upperUDF('text)).show

Where 'upper' function will be the method you must include the logic to transform date column.其中“上” function 将是您必须包含转换日期列的逻辑的方法。

Try Below Code.试试下面的代码。

scala> import org.joda.time.format._
import org.joda.time.format._

scala> spark.udf.register("datePart",(date:String) => DateTimeFormat.forPattern("MM-dd-yyyy").parseDateTime(date).toString(DateTimeFormat.forPattern("MMyyyy")))
res102: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> spark.sql("""select datePart("03-01-2019") as datepart""").show
+--------+
|datepart|
+--------+
|  032019|
+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM