[英]ML Pipeline for Spark Scala
I have a dataframe (df) with the following structure: 我有一个具有以下结构的数据帧(df):
Data 数据
label pa_age pa_gender_category
10000 32.0 male
25000 36.0 female
45000 68.0 female
15000 24.0 male
Objective 目的
I wanted to build a RandomForest Classifier for the column 'label' where column 'pa_age' and 'pa_gender_category' are the features 我想为列'label'构建一个RandomForest分类器,其中列'pa_age'和'pa_gender_category'是功能
Process Followed 流程紧随其后
// Transform the labels column into labels index
val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)
// Transform column gender_category into labels
val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
Expected Output from the above steps: 上述步骤的预期结果:
label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0 male 1.0 1.0
25000 36.0 female 2.0 2.0
45000 68.0 female 3.0 2.0
10000 24.0 male 1.0 1.0
Now I need the data into 'label' and 'feature' format 现在我需要将数据转换为“标签”和“特征”格式
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
Pipeline 管道
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))
Problem 问题
error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)
Basically its the step from converting data into label and feature format that I am facing trouble. 基本上它是将数据转换为标签和功能格式的步骤,我遇到了麻烦。
Is my process/pipeline correct here ? 我的流程/管道在这里是否正确?
The problem is here 问题出在这里
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
You can not call fit(df)
here, because VectorAssembler
does not have method fit
. 你不能在这里调用
fit(df)
,因为VectorAssembler
没有方法fit
。 Do not forget to remove .fit(df)
in StringIndexer
and IndexToString
also. 不要忘记在
StringIndexer
和IndexToString
删除.fit(df)
。 After the pipeline initialization call your fit
method on pipeline object. 在管道初始化之后,在管道对象上调用
fit
方法。
val model = pipeline.fit(df)
Now pipeline goes through each algorithm which you provided into it. 现在,管道会遍历您提供给它的每个算法。
StringIndexer
does not have property labels
, use getOutputCol
instead of it. StringIndexer
没有属性labels
,请使用getOutputCol
而不是它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.