简体   繁体   中英

Spark mllib Classification using scala

I am new to the Spark infrastructure so this question may be silly. I use the mllib for text classification. I have a set of sentences with labels which I feed to a MultinomialNaiveBayes classifier for training. I found an example for that.

My input is in this form:

Wed Dec 31 23:13:30 +0000 2014,1,spending new years eve,0

Wed Dec 31 23:14:37 +0000 2014,1,bold angel,0

Wed Dec 31 23:14:53 +0000 2014,1,loren good give,0

var htf = new HashingTF(2000000)
val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(1).toDouble, htf.transform(parts(2).split(' ')))
    }
val model = NaiveBayes.train(parsedData, lambda = 1.0, modelType = "multinomial")

So I take the text and with the use of the hash function I map the terms to the label{0,1}. After the training I want to predict the labels for an unlabeled dataset. So here begin my actual questions.

I do not have the labels for the text documents so I can not create the LabeledPoints. I tried to give "random" values (double) as labels like this (unlabeled data are stored in different structure, part(7) is the text here) :

val testing = sc.textFile("neutralSegment.txt")
val parsedData = testing.map { line =>
  val parts = line.split(',')
  htf.transform(parts(7).split(' '))
}
val predictionAndLabel = parsedData.map(p => (model.predict(p)))

How can I extract the processed data to its original form including the labels ? The classifier produces the labels and the terms have been transformed to doubles. I just want to concatenate the original string with the produced label from the classifier. Given this input:

16800,Wed Dec 31 23:03:23 +0000 2014,null,DJVINCE1 on now till 8 with your New Year's Eve Countdown mix!!,0,neutral,null,djvince now till new year eve countdown mix

How can map the produced label to this input in order to get an output like this:

16800,Wed Dec 31 23:03:23 +0000 2014,null,DJVINCE1 on now till 8 with your New Year's Eve Countdown mix!!,0,neutral,null,djvince now till new year eve countdown mix, label{0,1}

Ok as it seems all I had to do was to create tuples including my original text and the Vector[Double] from the hash function:

val parsedData = testing.map { line =>
  val parts = line.split(',')
  val text = parts(7).split(' ')
  (line, htf.transform(text))
} 

Then feed them to the classifier and again create the tuple of the result plus the text. Now I can use the structure which contains both fields I want.

val predictionAndLabel2 = parsedData.map(p =>
  (p._1 , model.predict(p._2))
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM