简体   繁体   English

尝试从UDF执行Spark SQL查询

[英]Trying to execute a spark sql query from a UDF

I am trying to write a inline function in spark framework using scala which will take a string input, execute a sql statement and return me a String value 我正在尝试使用Scala在Spark框架中编写内联函数,该函数将接受字符串输入,执行sql语句并返回一个String值

val testfunc: (String=>String)= (arg1:String) => 
{val k = sqlContext.sql("""select c_code from r_c_tbl where x_nm = "something" """)                               
 k.head().getString(0)
}

I am registering this scala function as an UDF 我正在将此Scala函数注册为UDF

   val testFunc_test = udf(testFunc)

I have a dataframe over a hive table 我在蜂巢表上有一个数据框

    val df = sqlContext.table("some_table")

Then I am calling the udf in a withColumn and trying to save it in a new dataframe. 然后,我在withColumn中调用udf,并尝试将其保存在新的数据框中。

    val new_df = df.withColumn("test", testFunc_test($"col1"))

But everytime i try do this i get an error 但是每次我尝试这样做我都会收到一个错误

16/08/10 21:17:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,       10.0.1.5): java.lang.NullPointerException
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:41)
    at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
    at org.apache.spark.sql.DataFrame.foreach(DataFrame.scala:1434)

I am relatively new to spark and scala . 我是Spark和Scala的新手。 But I am not sure why this code should not run. 但是我不确定为什么不应该运行此代码。 Any insights or an work around will be highly appreciated. 任何见解或变通将不胜感激。

Please note that I have not pasted the whole error stack . 请注意,我还没有粘贴整个错误堆栈。 Please let me know if it is required. 请让我知道是否需要。

You can't use sqlContext in your UDF - UDFs must be serializable to be shipped to executors, and the context (which can be thought of as a connection to the cluster) can't be serialized and sent to the node - only the driver application (where the UDF is defined , but not executed ) can use the sqlContext . 您不能在UDF中使用sqlContext -UDF必须可序列化才能交付给执行者,并且上下文(可以认为是与集群的连接)不能序列化并发送到节点-仅驱动程序应用程序( 定义了UDF但未执行的应用程序)可以使用sqlContext

Looks like your usecase (perform a select from table X per record in table Y) would better be accomplished by using a join . 看起来您的用例(对表Y中的每条记录从表X中执行选择)最好通过使用join来完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM