如何在 Spark DataFrame 中使用已注册的 Hive UDF？

Question

I have registered my hive UDF using the following code:我已经使用以下代码注册了我的配置单元 UDF：

hiveContext.udf().register("MyUDF",new UDF1(String,String)) {
public String call(String o) throws Execption {
//bla bla
}
},DataTypes.String);

Now I want to use above MyUDF in DataFrame .现在我想在 MyUDF 上面使用DataFrame 。 How do we use it?我们如何使用它？ I know how to use it in a SQL and it works fine我知道如何在 SQL 中使用它并且它工作正常

hiveContext.sql(select MyUDF("test") from myTable);

My hiveContext.sql() query involves group by on multiple columns so for scaling purpose I am trying to convert this query into DataFrame APIs我的hiveContext.sql()查询涉及对多个列进行分组，因此出于扩展目的，我试图将此查询转换为 DataFrame API

dataframe.select("col1","col2","coln").groupby(""col1","col2","coln").count();

Can we do the following: dataframe.select(MyUDF("col1")) ?我们可以执行以下操作： dataframe.select(MyUDF("col1"))吗？

Answer 1

I tested the following with pyspark 3.x running on top of yarn and it works我使用在纱线上运行的 pyspark 3.x 测试了以下内容，它可以工作

from pyspark.sql.functions import expr

df1 = df.withColumn("result", expr("MyUDF('test')"))
df1.show()
df2 = df.selectExpr("MyUDF('test') as result").show()
df2.show()

In case you come across Class not found error.如果您遇到 Class not found 错误。 Then you might want to add the jar using spark.sql("ADD JAR hdfs://...")然后您可能想使用spark.sql("ADD JAR hdfs://...")添加 jar

如何在 Spark DataFrame 中使用已注册的 Hive UDF？

问题描述

1 个解决方案

解决方案1
0 2022-06-30 14:36:13

如何在 Spark DataFrame 中使用已注册的 Hive UDF？

问题描述

1 个解决方案

解决方案1 0 2022-06-30 14:36:13

解决方案1
0 2022-06-30 14:36:13