简体   繁体   English

如何在 Spark DataFrame 中使用已注册的 Hive UDF?

[英]How to use registered Hive UDF in Spark DataFrame?

I have registered my hive UDF using the following code:我已经使用以下代码注册了我的配置单元 UDF:

hiveContext.udf().register("MyUDF",new UDF1(String,String)) {
public String call(String o) throws Execption {
//bla bla
}
},DataTypes.String);

Now I want to use above MyUDF in DataFrame .现在我想在 MyUDF 上面使用DataFrame How do we use it?我们如何使用它? I know how to use it in a SQL and it works fine我知道如何在 SQL 中使用它并且它工作正常

hiveContext.sql(select MyUDF("test") from myTable);

My hiveContext.sql() query involves group by on multiple columns so for scaling purpose I am trying to convert this query into DataFrame APIs我的hiveContext.sql()查询涉及对多个列进行分组,因此出于扩展目的,我试图将此查询转换为 DataFrame API

dataframe.select("col1","col2","coln").groupby(""col1","col2","coln").count();

Can we do the following: dataframe.select(MyUDF("col1")) ?我们可以执行以下操作: dataframe.select(MyUDF("col1"))吗?

I tested the following with pyspark 3.x running on top of yarn and it works我使用在纱线上运行的 pyspark 3.x 测试了以下内容,它可以工作

from pyspark.sql.functions import expr

df1 = df.withColumn("result", expr("MyUDF('test')"))
df1.show()
df2 = df.selectExpr("MyUDF('test') as result").show()
df2.show()

In case you come across Class not found error.如果您遇到 Class not found 错误。 Then you might want to add the jar using spark.sql("ADD JAR hdfs://...")然后您可能想使用spark.sql("ADD JAR hdfs://...")添加 jar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM