[英]How to use registered Hive UDF in Spark DataFrame?
I have registered my hive UDF using the following code:我已经使用以下代码注册了我的配置单元 UDF:
hiveContext.udf().register("MyUDF",new UDF1(String,String)) {
public String call(String o) throws Execption {
//bla bla
}
},DataTypes.String);
Now I want to use above MyUDF in DataFrame
.现在我想在 MyUDF 上面使用DataFrame
。 How do we use it?我们如何使用它? I know how to use it in a SQL and it works fine我知道如何在 SQL 中使用它并且它工作正常
hiveContext.sql(select MyUDF("test") from myTable);
My hiveContext.sql()
query involves group by on multiple columns so for scaling purpose I am trying to convert this query into DataFrame APIs我的hiveContext.sql()
查询涉及对多个列进行分组,因此出于扩展目的,我试图将此查询转换为 DataFrame API
dataframe.select("col1","col2","coln").groupby(""col1","col2","coln").count();
Can we do the following: dataframe.select(MyUDF("col1"))
?我们可以执行以下操作: dataframe.select(MyUDF("col1"))
吗?
I tested the following with pyspark 3.x running on top of yarn and it works我使用在纱线上运行的 pyspark 3.x 测试了以下内容,它可以工作
from pyspark.sql.functions import expr
df1 = df.withColumn("result", expr("MyUDF('test')"))
df1.show()
df2 = df.selectExpr("MyUDF('test') as result").show()
df2.show()
In case you come across Class not found error.如果您遇到 Class not found 错误。 Then you might want to add the jar using spark.sql("ADD JAR hdfs://...")
然后您可能想使用spark.sql("ADD JAR hdfs://...")
添加 jar
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.