[英]Multiclass classification in Spark with Term Frequency
I am quite new to Apache Spark and MLlib and trying to do my first multiclass-classification model. 我对Apache Spark和MLlib并不陌生,并尝试建立我的第一个多类分类模型。 I stuck at some point... Here is my code: 我停留在某个地方...这是我的代码:
val input = sc.textFile("cars2.csv").map(line => line.split(";").toSeq)
Creating Data Frame: 创建数据框:
val sql = new SQLContext(sc)
val schema = StructType(List(StructField("Description", StringType), StructField("Brand", StringType), StructField("Fuel", StringType)))
val dataframe = sql.createDataFrame(input.map(row => Row(row(0), row(1), row(2))), schema)
My DataFrame looks like this: 我的DataFrame看起来像这样:
+-----------------+----------+------+
| Description| Brand| Fuel|
+-----------------+----------+------+
| giulietta 1.4TB|alfa romeo|PETROL|
| 4c|alfa romeo|PETROL|
| giulietta 2.0JTD|alfa romeo|DIESEL|
| Mito 1.4 Tjet |alfa romeo|PETROL|
| a1 1.4 TFSI| AUDI|PETROL|
| a1 1.0 TFSI| AUDi|PETROL|
| a3 1.4 TFSI| AUDI|PETROL|
| a3 1.2 TFSI| AUDI|PETROL|
| a3 2.0 Tdi| AUDI|DIESEL|
| a3 1.6 TDi| AUDI|DIESEL|
| a3 1.8tsi| AUDI|PETROL|
| RS3 | AUDI|PETROL|
| S3| AUDI|PETROL|
| A4 2.0TDI| AUDI|DIESEL|
| A4 2.0TDI| AUDI|DIESEL|
| A4 1.4 tfsi| AUDI|PETROL|
| A4 2.0TFSI| AUDI|PETROL|
| A4 3.0TDI| AUDI|DIESEL|
| X5 3.0D| BMW|DIESEL|
| 750I| BMW|PETROL|
Then: 然后:
//Tokenize
val tokenizer = new Tokenizer().setInputCol("Description").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)
//Creating term-frequency
val htf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
// Model & Pipeline
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(20).setRegParam(0.01)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, idf, lr))
//Model
val model = pipeline.fit(dataframe)
Error: 错误:
java.lang.IllegalArgumentException: Field "rawFeatures" does not exist.
I am trying to predict Brand and Fuel type by only reading Description. 我仅通过阅读说明来尝试预测品牌和燃料类型。
Thanks in advance 提前致谢
Two small issues with your code: 您的代码有两个小问题:
htf
variable isn't used, I assume it's missing from the pipeline? htf
变量未使用,我认为它在管道中丢失了吗? Since this is the PipelineStage
creating the rawFeatures
field required by the next stage, you get the Field does not exist
error. 由于这是PipelineStage
创建下一阶段所需的rawFeatures
字段,因此您将获得Field does not exist
错误。
Even if we fix this - the last stage (LogisticRegression) will fail because it requires a label
field with type DoubleType
, in addition to the features
field. 即使我们解决了这个问题,最后一个阶段(LogisticRegression)也会失败,因为除了features
字段之外,它还需要一个label
类型为DoubleType
的label
字段。 You'll need to add such a field to your dataframe before fitting. 在拟合之前,您需要将这样的字段添加到数据框中。
Changing the last rows in your code .. 更改代码中的最后一行
// pipeline - with "htf" stage added
val pipeline = new Pipeline().setStages(Array(tokenizer, htf, idf, lr))
//Model - with an addition (constant...) label field
val model = pipeline.fit(dataframe.withColumn("label", lit(1.0)))
... will make this finish successfully, but of course the labeling here is just for the example's sake, create the labels as you see fit. ...将成功完成此操作,但是这里的标签只是为了示例的缘故,请根据需要创建标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.