使用 Python 训练 model 为 Pyspark Dataframe

Question

为了训练 model，我在逻辑回归上训练了一个数据集，以开始并在下面的脚本中使用该 model，但它给我一个错误，提示“没有名为‘sklearn’的模块”我已经在那里安装了 package，但仍然无法正常工作. 有人可以告诉我可以做什么吗？ 这是我在这个博客上找到的脚本

import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.window import Window as w

model = LogisticRegression(C=1e5)
model.fit(X, Y)

#creating test data from Pyspark
vectorAssembler = VectorAssembler(inputCols = [col for col in df.columns if '_id' not in col and 'label' not in col], outputCol="features")
features_vectorized = vectorAssembler.transform(df)

model_broadcast = sc.broadcast(model)
# udf to predict on the cluster
def predict_new(feature_map):
    ids, features = zip(*[
        (k,  v) for d in feature_map for k, v in d.items()
    ])
    ind = model_broadcast.value.classes_.tolist().index(1.0)
    probs = [
        float(v) for v in 
        model_broadcast.value.predict_proba(features)[:, ind]
    ]
    return dict(zip(ids, probs))
predict_new_udf = f.udf(
    predict_new, 
    t.MapType(t.LongType(), t.FloatType()
))
# set the number of prediction groups to create
nparts = 5000
# put everything together
outcome_sdf = (
                features_vectorized.select(
                            f.create_map(f.col('id'), f.col('features')).alias('feature_map'), 
                            (f.row_number().over(w.partitionBy(f.lit(1)).orderBy(f.lit(1))) % nparts).alias('grouper')
                          )
                .groupby(f.col('grouper'))
                .agg(f.collect_list(f.col('feature_map')).alias('feature_map'))
                .select(predict_new_udf(f.col('feature_map')).alias('results'))
                .select(f.explode(f.col('results')).alias('unique_id', 'probability_estimate'))
            )

这运行并且执行得很好但是当我查找 outcome_sdf 的值时，我收到一个错误，没有名为 sklearn 的模块。 我阅读了有关在集群中安装 sklearn 的信息，有人可以帮助我吗？

Answer 1

您需要在集群的所有节点中安装sklearn ，而不是单个节点。

使用 Python 训练 model 为 Pyspark Dataframe

问题描述

1 个解决方案

解决方案1
0 2022-04-06 21:31:06

使用 Python 训练 model 为 Pyspark Dataframe

问题描述

1 个解决方案

解决方案1 0 2022-04-06 21:31:06

解决方案1
0 2022-04-06 21:31:06