Multiclass classification in Spark with Term Frequency

Question

I am quite new to Apache Spark and MLlib and trying to do my first multiclass-classification model. I stuck at some point... Here is my code:

val input = sc.textFile("cars2.csv").map(line => line.split(";").toSeq)

Creating Data Frame:

val sql = new SQLContext(sc)
val schema = StructType(List(StructField("Description", StringType), StructField("Brand", StringType), StructField("Fuel", StringType)))
val dataframe = sql.createDataFrame(input.map(row => Row(row(0), row(1), row(2))), schema)

My DataFrame looks like this:

+-----------------+----------+------+
|      Description|     Brand|  Fuel|
+-----------------+----------+------+
|  giulietta 1.4TB|alfa romeo|PETROL|
|               4c|alfa romeo|PETROL|
| giulietta 2.0JTD|alfa romeo|DIESEL|
|   Mito 1.4 Tjet |alfa romeo|PETROL|
|     a1 1.4  TFSI|      AUDI|PETROL|
|      a1 1.0 TFSI|      AUDi|PETROL|
|      a3 1.4 TFSI|      AUDI|PETROL|
|      a3 1.2 TFSI|      AUDI|PETROL|
|       a3 2.0 Tdi|      AUDI|DIESEL|
|       a3 1.6 TDi|      AUDI|DIESEL|
|        a3 1.8tsi|      AUDI|PETROL|
|             RS3 |      AUDI|PETROL|
|               S3|      AUDI|PETROL|
|        A4 2.0TDI|      AUDI|DIESEL|
|        A4 2.0TDI|      AUDI|DIESEL|
|      A4 1.4 tfsi|      AUDI|PETROL|
|       A4 2.0TFSI|      AUDI|PETROL|
|        A4 3.0TDI|      AUDI|DIESEL|
|          X5 3.0D|       BMW|DIESEL|
|             750I|       BMW|PETROL|

Then:

//Tokenize
val tokenizer = new Tokenizer().setInputCol("Description").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)

    //Creating term-frequency 
val htf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")


// Model & Pipeline
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(20).setRegParam(0.01)

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, idf, lr))
//Model
val model = pipeline.fit(dataframe)

Error:

java.lang.IllegalArgumentException: Field "rawFeatures" does not exist.

I am trying to predict Brand and Fuel type by only reading Description.

Thanks in advance

Answer 1

Two small issues with your code:

htf variable isn't used, I assume it's missing from the pipeline? Since this is the PipelineStage creating the rawFeatures field required by the next stage, you get the Field does not exist error.
Even if we fix this - the last stage (LogisticRegression) will fail because it requires a label field with type DoubleType , in addition to the features field. You'll need to add such a field to your dataframe before fitting.

Changing the last rows in your code ..

// pipeline - with "htf" stage added
val pipeline = new Pipeline().setStages(Array(tokenizer, htf, idf, lr))
//Model - with an addition (constant...) label field 
val model = pipeline.fit(dataframe.withColumn("label", lit(1.0)))

... will make this finish successfully, but of course the labeling here is just for the example's sake, create the labels as you see fit.

Multiclass classification in Spark with Term Frequency

Question

1 answers

solution1
0 2016-04-23 14:11:49

Multiclass classification in Spark with Term Frequency

Question

1 answers

solution1 0 2016-04-23 14:11:49

solution1
0 2016-04-23 14:11:49