简体   繁体   English

Spark中带有术语频率的多类分类

[英]Multiclass classification in Spark with Term Frequency

I am quite new to Apache Spark and MLlib and trying to do my first multiclass-classification model. 我对Apache Spark和MLlib并不陌生,并尝试建立我的第一个多类分类模型。 I stuck at some point... Here is my code: 我停留在某个地方...这是我的代码:

val input = sc.textFile("cars2.csv").map(line => line.split(";").toSeq)

Creating Data Frame: 创建数据框:

val sql = new SQLContext(sc)
val schema = StructType(List(StructField("Description", StringType), StructField("Brand", StringType), StructField("Fuel", StringType)))
val dataframe = sql.createDataFrame(input.map(row => Row(row(0), row(1), row(2))), schema)

My DataFrame looks like this: 我的DataFrame看起来像这样:

+-----------------+----------+------+
|      Description|     Brand|  Fuel|
+-----------------+----------+------+
|  giulietta 1.4TB|alfa romeo|PETROL|
|               4c|alfa romeo|PETROL|
| giulietta 2.0JTD|alfa romeo|DIESEL|
|   Mito 1.4 Tjet |alfa romeo|PETROL|
|     a1 1.4  TFSI|      AUDI|PETROL|
|      a1 1.0 TFSI|      AUDi|PETROL|
|      a3 1.4 TFSI|      AUDI|PETROL|
|      a3 1.2 TFSI|      AUDI|PETROL|
|       a3 2.0 Tdi|      AUDI|DIESEL|
|       a3 1.6 TDi|      AUDI|DIESEL|
|        a3 1.8tsi|      AUDI|PETROL|
|             RS3 |      AUDI|PETROL|
|               S3|      AUDI|PETROL|
|        A4 2.0TDI|      AUDI|DIESEL|
|        A4 2.0TDI|      AUDI|DIESEL|
|      A4 1.4 tfsi|      AUDI|PETROL|
|       A4 2.0TFSI|      AUDI|PETROL|
|        A4 3.0TDI|      AUDI|DIESEL|
|          X5 3.0D|       BMW|DIESEL|
|             750I|       BMW|PETROL|

Then: 然后:

//Tokenize
val tokenizer = new Tokenizer().setInputCol("Description").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)

    //Creating term-frequency 
val htf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")


// Model & Pipeline
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(20).setRegParam(0.01)

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, idf, lr))
//Model
val model = pipeline.fit(dataframe)

Error: 错误:

java.lang.IllegalArgumentException: Field "rawFeatures" does not exist.

I am trying to predict Brand and Fuel type by only reading Description. 我仅通过阅读说明来尝试预测品牌和燃料类型。

Thanks in advance 提前致谢

Two small issues with your code: 您的代码有两个小问题:

  1. htf variable isn't used, I assume it's missing from the pipeline? htf变量未使用,我认为它在管道中丢失了吗? Since this is the PipelineStage creating the rawFeatures field required by the next stage, you get the Field does not exist error. 由于这是PipelineStage创建下一阶段所需的rawFeatures字段,因此您将获得Field does not exist错误。

  2. Even if we fix this - the last stage (LogisticRegression) will fail because it requires a label field with type DoubleType , in addition to the features field. 即使我们解决了这个问题,最后一个阶段(LogisticRegression)也会失败,因为除了features字段之外,它还需要一个label类型为DoubleTypelabel字段。 You'll need to add such a field to your dataframe before fitting. 在拟合之前,您需要将这样的字段添加到数据框中。

Changing the last rows in your code .. 更改代码中的最后一行

// pipeline - with "htf" stage added
val pipeline = new Pipeline().setStages(Array(tokenizer, htf, idf, lr))
//Model - with an addition (constant...) label field 
val model = pipeline.fit(dataframe.withColumn("label", lit(1.0)))

... will make this finish successfully, but of course the labeling here is just for the example's sake, create the labels as you see fit. ...将成功完成此操作,但是这里的标签只是为了示例的缘故,请根据需要创建标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM