简体   繁体   English

Spark UDF中键入不匹配

[英]Type mismatch in Spark UDF

I have created following UDF to fetch only 1st part of decimal values. 我创建了以下UDF来仅获取十进制值的第一部分。

def udf_cleansing(col1 : Double) = udf((col1 : Double) => {
val col2 : String = f"$col1%.5f"
if(col2.trim == "" || col2 == null ) 0.toString else col2.substring(0,col2.indexOf("."))}
)

However, while calling this function using command like 但是,在使用命令调用此函数时

df_aud.select(udf_cleansing(df_aud("HASH_TTL")))

I am getting follwing error :- 我得到了以下错误: -

<console>:42: error: type mismatch; <console>:42:错误:类型不匹配;

found : org.apache.spark.sql.Column 发现:org.apache.spark.sql.Column

required: Double 要求:双倍

df_aud.select(udf_cleansing(df_aud("HASH_TTL"))) df_aud.select(udf_cleansing(df_aud( “HASH_TTL”)))

I tried with command 我试着用命令

df_aud.withColumn("newc",udf_cleansing(df_aud("HASH_TTL").cast("double")))

Still getting same error. 仍然得到同样的错误。

The reason is that Scala treats df_aud("HASH_TTL") as a parameter to udf_cleansing function, not to UDF this function returned. 原因是Scala将df_aud("HASH_TTL")视为udf_cleansing函数的参数,而不是将此函数返回给UDF。

Instead, you should write: 相反,你应该写:

def udf_cleansing = udf(
    (col1 : Double) => {
        val col2 : String = f"$col1%.5f"
        if(col2.trim == "" || col2 == null ) 0.toString else col2.substring(0,col2.indexOf("."))
    }
)

Now udf_cleansing returns an UDF. 现在udf_cleansing返回一个UDF。 UDF1 function as a parameter of type Column and this column's value is provided to wrapped inner function. UDF1用作Column类型的参数,此列的值提供给包装的内部函数。

And then use is exactly how you tried to use this function. 然后使用正是您尝试使用此功能的方式。

I would recommend you to use spark functions as much as possible. 我建议你尽可能使用火花功能 If any of the inbuilt functions cannot satisfy your needs, then only I would suggest you to go with udf functions as udf functions would require the data to be serialized and deserialized to perform the operation you have devised. 如果任何内置函数无法满足您的需求,那么只有我建议您使用udf函数,因为udf函数需要序列化 序列化数据才能执行您设计的操作。

Your udf function can be performed by using format_string and substring_index inbuilt functions as below 您的udf功能可以通过使用进行format_stringsubstring_index如下内置功能

import org.apache.spark.sql.functions._
df_aud.select(substring_index(format_string("%.5f", df_aud("HASH_TTL")), ".", 1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM