使用非原始數據類型創建 UDF function 並在 Spark-sql 查詢中使用：Scala

Question

I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i想將它創建為可重用的函數/方法，所以只要它需要我就可以調用它。 我在我的 scala class 中創建了下面的 function。

def date_part(date_column:Column) = {
    val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give  value as 01,02...etc

    m1 match {
        case 01 => concat(concat(year(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM- yyyy"))))-1,'-'),substr(year(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))),3,4))
        //etc..
        case _ => "some other logic"
    }
}

但它顯示多個錯誤。

對於 01：

◾十進制 integer 文字可能沒有前導零。 （八進制語法已過時。）

◾類型不匹配； 找到：需要 Int(0)：org.apache.spark.sql.Column。

為了 '-'：

類型不匹配; 找到：需要字符（'-'）：org.apache.spark.sql.Column。

對於'substr'：

未找到：值 substr。

另外，如果我要創建任何簡單的 function 並且類型為列，我將無法注冊它，因為我在列格式中無法獲得錯誤。對於所有原始數據類型（字符串、長整數、整數）它的工作很好。但在我的情況下，類型是列，所以我無法做到這一點。有人可以指導我該怎么做。截至目前，我在堆棧溢出上發現我需要將這個 function 與 df 一起使用，然后需要將此df轉換為臨時表。有人可以指導我任何其他替代方式，因此無需對現有代碼進行太多更改，我就可以使用此功能。

Answer 1

首先，Spark 需要讀取一個存儲數據的文件，我猜這個文件是 CSV 但你可以使用 csv 的方法 json。

然后，您可以添加具有計算值的新列，如下所示：

     import org.apache.spark.sql.functions._

      val df = spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("/path/mydata.csv")

      def transformDate( dateColumn: String, df: DataFrame) : DataFrame = {
         df.withColumn("calculatedCol", month(to_date(from_unixtime(unix_timestamp(col(dateColumn), "dd-MM-yyyy")))))

         df.withColumn("newColumnWithDate",  when(col("calculatedCol") === "01", concat(concat(year(to_date(from_unixtime(unix_timestamp(col("calculatedCol"), "dd-MM- yyyy"))))-1, lit('-')),substring(year(to_date(from_unixtime(unix_timestamp(col("calculatedCol")), "dd-MM-yyyy"))),4,2))
          .when(col("calculatedCol") === "02","some other logic")
          .otherwise("nothing match")))
      }

     // calling your function for the Dataframe you want transform date column:
     transformDate("date_column", df)

請注意，某些函數需要列作為參數，而不是字符串值，因此請使用 lit() 指定該值。

不需要 UDF（並且在性能方面不推薦），但您可以通過以下方式使用它：

val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
df.withColumn("upper", upperUDF('text)).show

其中“上” function 將是您必須包含轉換日期列的邏輯的方法。

Answer 2

試試下面的代碼。

scala> import org.joda.time.format._
import org.joda.time.format._

scala> spark.udf.register("datePart",(date:String) => DateTimeFormat.forPattern("MM-dd-yyyy").parseDateTime(date).toString(DateTimeFormat.forPattern("MMyyyy")))
res102: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> spark.sql("""select datePart("03-01-2019") as datepart""").show
+--------+
|datepart|
+--------+
|  032019|
+--------+

使用非原始數據類型創建 UDF function 並在 Spark-sql 查詢中使用：Scala

問題描述

2 個解決方案

解決方案1
0 2020-05-07 06:03:55

解決方案2
0 已采納 2020-05-07 06:16:49

使用非原始數據類型創建 UDF function 並在 Spark-sql 查詢中使用：Scala

問題描述

2 個解決方案

解決方案1 0 2020-05-07 06:03:55

解決方案2 0 已采納 2020-05-07 06:16:49

解決方案1
0 2020-05-07 06:03:55

解決方案2
0 已采納 2020-05-07 06:16:49