简体   繁体   English

使用 HyperOpt 和 SparkTrials 时,如何在工作节点上创建 spark dataframe?

[英]How do you create a spark dataframe on a worker node when using HyperOpt and SparkTrials?

I'm trying to run ML trials in parallel using HyperOpt with SparkTrials on Databricks.我正在尝试在 Databricks 上使用 HyperOpt 和 SparkTrials 并行运行 ML 试验。

My opjective function converts the outputs to a spark dataframe using spark.createDataFrame(results) (to reuse some preprocessing code I've previously created - I'd prefer not to have to rewrite this).我的客观 function 使用spark.createDataFrame(results)将输出转换为火花 dataframe (以重用我之前创建的一些预处理代码 - 我不想重写它)。

However, this causes an error when attempting to use HyperOpt and SparkTrials, as the SparkContext used to create the dataframe "should only be created or accessed on the driver".但是,这会在尝试使用 HyperOpt 和 SparkTrials 时导致错误,因为用于创建 dataframe 的 SparkContext“只能在驱动程序上创建或访问”。 Is there any way I can create a sparkDataFrame in my objective function here?有什么方法可以在我的目标 function 中创建 sparkDataFrame 吗?

For a reproducible example:对于一个可重现的例子:

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

from pyspark.sql import SparkSession

# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line. 
import mlflow

# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target

def objective(C):
    # Create a support vector classifier model
    clf = SVC(C)
    # THESE TWO LINES CAUSE THE PROBLEM
    ss = SparkSession.builder.getOrCreate()
    sdf = ss.createDataFrame([('Alice', 1)])

    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(clf, X, y).mean()
    
    # Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'status': STATUS_OK}

search_space = hp.lognormal('C', 0, 1.0)
algo=tpe.suggest

# THIS WORKS (It's not using SparkTrials)
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16)

from hyperopt import SparkTrials

spark_trials = SparkTrials()

# THIS FAILS
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16,
  trials=spark_trials)

I have tried looking at this, but it is solving a different problem - I can't see an obvious way to apply it to my situation.我试过看这个,但它正在解决一个不同的问题——我看不到一个明显的方法来将它应用到我的情况。 How can I get the current SparkSession in any place of the codes? 如何在代码的任何位置获取当前的 SparkSession?

I'm curious if anyone looked further into this?我很好奇是否有人对此进行了深入研究? I'm running into the same issue with my use case where I using Spark's ML Pipelines to transform raw DataFrame and train the model.我在使用 Spark 的 ML Pipelines 转换原始 DataFrame 并训练 model 的用例中遇到了同样的问题。

I also tried querying the data first and save it outside of the objective function, but that complicates things as we can't simply serialize a DataFrame, similar to issue in this thread我还尝试先查询数据并将其保存在目标 function 之外,但这会使事情复杂化,因为我们不能简单地序列化 DataFrame,类似于此线程中的问题

I think the short answer is that it's not possible.我认为简短的回答是不可能的。 The spark context can only exist on the driver node.火花上下文只能存在于驱动节点上。 Creating a new instance would be a kind of nesting, see this related question.创建一个新实例将是一种嵌套,请参阅此相关问题。

Nesting parallelizations in Spark?在 Spark 中嵌套并行化? What's the right approach?什么是正确的方法?

I solved my problem in the end by rewriting the transformations in pandas, which would then work.我最终通过重写 pandas 中的转换解决了我的问题,然后就可以了。

If the transformations are too big for a single node then you'd probably have to pre-compute them and let hyperopt choose which version as part of the optimisation.如果转换对于单个节点来说太大,那么您可能必须预先计算它们并让 hyperopt 选择哪个版本作为优化的一部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM