使用 HyperOpt 和 SparkTrials 时，如何在工作节点上创建 spark dataframe？

Question

I'm trying to run ML trials in parallel using HyperOpt with SparkTrials on Databricks.我正在尝试在 Databricks 上使用 HyperOpt 和 SparkTrials 并行运行 ML 试验。

My opjective function converts the outputs to a spark dataframe using spark.createDataFrame(results) (to reuse some preprocessing code I've previously created - I'd prefer not to have to rewrite this).我的客观 function 使用spark.createDataFrame(results)将输出转换为火花 dataframe （以重用我之前创建的一些预处理代码 - 我不想重写它）。

However, this causes an error when attempting to use HyperOpt and SparkTrials, as the SparkContext used to create the dataframe "should only be created or accessed on the driver".但是，这会在尝试使用 HyperOpt 和 SparkTrials 时导致错误，因为用于创建 dataframe 的 SparkContext“只能在驱动程序上创建或访问”。 Is there any way I can create a sparkDataFrame in my objective function here?有什么方法可以在我的目标 function 中创建 sparkDataFrame 吗？

For a reproducible example:对于一个可重现的例子：

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

from pyspark.sql import SparkSession

# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line. 
import mlflow

# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target

def objective(C):
    # Create a support vector classifier model
    clf = SVC(C)
    # THESE TWO LINES CAUSE THE PROBLEM
    ss = SparkSession.builder.getOrCreate()
    sdf = ss.createDataFrame([('Alice', 1)])

    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(clf, X, y).mean()
    
    # Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'status': STATUS_OK}

search_space = hp.lognormal('C', 0, 1.0)
algo=tpe.suggest

# THIS WORKS (It's not using SparkTrials)
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16)

from hyperopt import SparkTrials

spark_trials = SparkTrials()

# THIS FAILS
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16,
  trials=spark_trials)

I have tried looking at this, but it is solving a different problem - I can't see an obvious way to apply it to my situation.我试过看这个，但它正在解决一个不同的问题——我看不到一个明显的方法来将它应用到我的情况。 How can I get the current SparkSession in any place of the codes? 如何在代码的任何位置获取当前的 SparkSession？

Answer 1

I'm curious if anyone looked further into this?我很好奇是否有人对此进行了深入研究？ I'm running into the same issue with my use case where I using Spark's ML Pipelines to transform raw DataFrame and train the model.我在使用 Spark 的 ML Pipelines 转换原始 DataFrame 并训练 model 的用例中遇到了同样的问题。

I also tried querying the data first and save it outside of the objective function, but that complicates things as we can't simply serialize a DataFrame, similar to issue in this thread我还尝试先查询数据并将其保存在目标 function 之外，但这会使事情复杂化，因为我们不能简单地序列化 DataFrame，类似于此线程中的问题

Answer 2

I think the short answer is that it's not possible.我认为简短的回答是不可能的。 The spark context can only exist on the driver node.火花上下文只能存在于驱动节点上。 Creating a new instance would be a kind of nesting, see this related question.创建一个新实例将是一种嵌套，请参阅此相关问题。

Nesting parallelizations in Spark?在 Spark 中嵌套并行化？ What's the right approach?什么是正确的方法？

I solved my problem in the end by rewriting the transformations in pandas, which would then work.我最终通过重写 pandas 中的转换解决了我的问题，然后就可以了。

If the transformations are too big for a single node then you'd probably have to pre-compute them and let hyperopt choose which version as part of the optimisation.如果转换对于单个节点来说太大，那么您可能必须预先计算它们并让 hyperopt 选择哪个版本作为优化的一部分。

使用 HyperOpt 和 SparkTrials 时，如何在工作节点上创建 spark dataframe？

问题描述

1 个解决方案

解决方案1
0 2022-09-16 22:46:26

解决方案2
0 2022-09-17 09:17:47

使用 HyperOpt 和 SparkTrials 时，如何在工作节点上创建 spark dataframe？

问题描述

1 个解决方案

解决方案1 0 2022-09-16 22:46:26

解决方案2 0 2022-09-17 09:17:47

解决方案1
0 2022-09-16 22:46:26

解决方案2
0 2022-09-17 09:17:47