用于Spark Scala的ML管道

Question

I have a dataframe (df) with the following structure: 我有一个具有以下结构的数据帧（df）：

Data 数据

label pa_age pa_gender_category
10000 32.0   male
25000 36.0   female
45000 68.0   female
15000 24.0   male

Objective 目的

I wanted to build a RandomForest Classifier for the column 'label' where column 'pa_age' and 'pa_gender_category' are the features 我想为列'label'构建一个RandomForest分类器，其中列'pa_age'和'pa_gender_category'是功能

Process Followed 流程紧随其后

// Transform the labels column into labels index

val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)

// Transform column gender_category into labels

val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

Expected Output from the above steps: 上述步骤的预期结果：

label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0   male               1.0          1.0
25000 36.0   female             2.0          2.0
45000 68.0   female             3.0          2.0
10000 24.0   male               1.0          1.0

Now I need the data into 'label' and 'feature' format 现在我需要将数据转换为“标签”和“特征”格式

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)

Pipeline 管道

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))

Problem 问题

error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
       val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)

Basically its the step from converting data into label and feature format that I am facing trouble. 基本上它是将数据转换为标签和功能格式的步骤，我遇到了麻烦。
Is my process/pipeline correct here ? 我的流程/管道在这里是否正确？

Answer 1

The problem is here 问题出在这里

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)

You can not call fit(df) here, because VectorAssembler does not have method fit . 你不能在这里调用fit(df) ，因为VectorAssembler没有方法fit 。 Do not forget to remove .fit(df) in StringIndexer and IndexToString also. 不要忘记在StringIndexer和IndexToString删除.fit(df) 。 After the pipeline initialization call your fit method on pipeline object. 在管道初始化之后，在管道对象上调用fit方法。

val model = pipeline.fit(df)

Now pipeline goes through each algorithm which you provided into it. 现在，管道会遍历您提供给它的每个算法。

StringIndexer does not have property labels , use getOutputCol instead of it. StringIndexer没有属性labels ，请使用getOutputCol而不是它。

用于Spark Scala的ML管道

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-04-27 08:07:11

用于Spark Scala的ML管道

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-04-27 08:07:11

解决方案1
1 已采纳 2017-04-27 08:07:11