简体   繁体   English

从任务中调用Java / Scala函数

[英]Calling Java/Scala function from a task

Background 背景

My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? 我的原始问题是, 为什么在地图函数中使用DecisionTreeModel.predict会引发异常? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib? 并与如何使用MLlib在Spark上生成元组(原始标签,预测标签)有关?

When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD : 当我们使用Scala API时,建议使用DecisionTreeModel获取RDD[LabeledPoint]预测的方法是简单地映射到RDD

val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

Unfortunately similar approach in PySpark doesn't work so well: 不幸的是,PySpark中的类似方法不能很好地工作:

labelsAndPredictions = testData.map(
    lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. 例外:似乎您正在尝试从广播变量,操作或转换中引用SparkContext。 SparkContext can only be used on the driver, not in code that it run on workers. SparkContext只能在驱动程序上使用,而不能在工作程序上运行的代码中使用。 For more information, see SPARK-5063 . 有关更多信息,请参见SPARK-5063

Instead of that official documentation recommends something like this: 而不是官方文档建议这样的事情:

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

So what is going on here? 那么这是怎么回事? There is no broadcast variable here and Scala API defines predict as follows: 这里没有广播变量, Scala API定义的predict如下:

/**
 * Predict values for a single data point using the model trained.
 *
 * @param features array representing a single data point
 * @return Double prediction from the trained model
 */
def predict(features: Vector): Double = {
  topNode.predict(features)
}

/**
 * Predict values for the given data set using the model trained.
 *
 * @param features RDD representing data points to be predicted
 * @return RDD of predictions for each of the given data points
 */
def predict(features: RDD[Vector]): RDD[Double] = {
  features.map(x => predict(x))
}

so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation. 因此至少乍一看,从动作或转换中调用就不是问题,因为预测似乎是本地操作。

Explanation 说明

After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict . 经过一番挖掘,我发现问题的根源是从DecisionTreeModel.predict调用的JavaModelWrapper.call方法。 It access SparkContext which is required to call Java function: 访问调用Java函数所需的SparkContext

callJavaFunc(self._sc, getattr(self._java_model, name), *a)

Question

In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general? 如果使用DecisionTreeModel.predict ,则有一个推荐的解决方法,所有必需的代码已经是Scala API的一部分,但是是否有任何优雅的方法来处理这样的问题?

Only solutions I can think of right now are rather heavyweight: 我现在能想到的只有解决方案才是重量级的:

  • pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers 通过隐式转换扩展Spark类或添加某种包装将所有内容推送到JVM
  • using Py4j gateway directly 直接使用Py4j网关

Communication using default Py4J gateway is simply not possible. 根本不可能使用默认的Py4J网关进行通信。 To understand why we have to take a look at the following diagram from the PySpark Internals document [1]: 要理解为什么我们必须看一下PySpark内部文档[1]中的下图:

在此处输入图片说明

Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py ). 由于Py4J网关在驱动程序上运行,因此Python解释器无法通过套接字与JVM工作者进行通信(例如,参见PythonRDD / rdd.py ),因此无法访问。

Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. 从理论上讲,有可能为每个工作人员创建一个单独的Py4J网关,但是在实践中,它不太可能有用。 Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks. 忽略诸如可靠性之类的问题Py4J并非旨在执行数据密集型任务。

Are there any workarounds? 有什么解决方法吗?

  1. Using Spark SQL Data Sources API to wrap JVM code. 使用Spark SQL数据源API包装JVM代码。

    Pros : Supported, high level, doesn't require access to the internal PySpark API 优点 :受支持的高级,不需要访问内部PySpark API

    Cons : Relatively verbose and not very well documented, limited mostly to the input data 缺点 :比较冗长,没有很好的记录,主要限于输入数据

  2. Operating on DataFrames using Scala UDFs. 使用Scala UDF在DataFrame上进行操作。

    Pros : Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions? ), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J 优点 :易于实现(请参阅Spark:如何使用Scala或Java用户定义函数映射Python? ),如果数据已存储在DataFrame中,则Python和Scala之间无需进行数据转换,对Py4J的访问最少

    Cons : Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported 缺点 :需要访问Py4J网关和内部方法,仅限于Spark SQL,难以调试,不支持

  3. Creating high level Scala interface in a similar way how it is done in MLlib. 以与MLlib中类似的方式创建高级Scala接口。

    Pros : Flexible, ability to execute arbitrary complex code. 优点 :灵活,能够执行任意复杂的代码。 It can be don either directly on RDD (see for example MLlib model wrappers ) or with DataFrames (see How to use a Scala class inside Pyspark ). 它既可以直接在RDD上使用(例如,参见MLlib模型包装器 ),也可以与DataFrames 一起使用 (请参见如何在Pyspark中使用Scala类 )。 The latter solution seems to be much more friendly since all ser-de details are already handled by existing API. 后一种解决方案似乎更友好,因为所有序列细节都已由现有API处理。

    Cons : Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported 缺点 :低级,必需的数据转换,与UDF相同,需要访问Py4J和内部API,不支持

    Some basic examples can be found in Transforming PySpark RDD with Scala 可以在使用Scala转换PySpark RDD中找到一些基本示例

  4. Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS. 使用外部工作流管理工具在Python和Scala / Java作业之间切换,并将数据传递到DFS。

    Pros : Easy to implement, minimal changes to the code itself 优点 :易于实现,对代码本身的更改最少

    Cons : Cost of reading / writing data ( Alluxio ?) 缺点 :读取/写入数据的成本( Alluxio吗?)

  5. Using shared SQLContext (see for example Apache Zeppelin or Livy ) to pass data between guest languages using registered temporary tables. 使用共享的SQLContext (例如,参见Apache ZeppelinLivy )使用已注册的临时表在来宾语言之间传递数据。

    Pros : Well suited for interactive analysis 优点 :非常适合交互式分析

    Cons : Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy) 缺点 :对于批处理作业(Zeppelin)而言不是很多,或者可能需要其他编排(Livy)


  1. Joshua Rosen. 约书亚·罗森(Joshua Rosen)。 (2014, August 04) PySpark Internals . (2014年8月4日) PySpark Internals Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 取自https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM