使用Scala进行Spark mllib分类

Question

I am new to the Spark infrastructure so this question may be silly. 我是Spark基础架构的新手，所以这个问题可能很愚蠢。 I use the mllib for text classification. 我使用mllib进行文本分类。 I have a set of sentences with labels which I feed to a MultinomialNaiveBayes classifier for training. 我有一组带有标签的句子，然后将它们输入到MultinomialNaiveBayes分类器中进行训练。 I found an example for that. 我找到了一个例子。

My input is in this form: 我的输入是这样的：

Wed Dec 31 23:13:30 +0000 2014,1,spending new years eve,0 2014年12月31日，星期三23:13:30 +0000，1，除夕夜，0

Wed Dec 31 23:14:37 +0000 2014,1,bold angel,0 2014年12月31日星期三23:14:37 +0000 2014,1，大胆的天使，0

Wed Dec 31 23:14:53 +0000 2014,1,loren good give,0 2014年12月31日，星期三23:14:53 + 0000，1，loren good Give，0

var htf = new HashingTF(2000000)
val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(1).toDouble, htf.transform(parts(2).split(' ')))
    }
val model = NaiveBayes.train(parsedData, lambda = 1.0, modelType = "multinomial")

So I take the text and with the use of the hash function I map the terms to the label{0,1}. 因此，我取了文本，并使用哈希函数将术语映射到标签{0,1}。 After the training I want to predict the labels for an unlabeled dataset. 训练后，我想预测未标记数据集的标记。 So here begin my actual questions. 因此，这里开始我的实际问题。

I do not have the labels for the text documents so I can not create the LabeledPoints. 我没有文本文档的标签，因此无法创建LabeledPoints。 I tried to give "random" values (double) as labels like this (unlabeled data are stored in different structure, part(7) is the text here) : 我试图给“随机”值（双精度）作为这样的标签（未标记的数据存储在不同的结构中，part（7）是此处的文本）：

val testing = sc.textFile("neutralSegment.txt")
val parsedData = testing.map { line =>
  val parts = line.split(',')
  htf.transform(parts(7).split(' '))
}
val predictionAndLabel = parsedData.map(p => (model.predict(p)))

How can I extract the processed data to its original form including the labels ? 如何将处理后的数据提取为包含标签的原始格式？ The classifier produces the labels and the terms have been transformed to doubles. 分类器生成标签，并且术语已转换为双精度。 I just want to concatenate the original string with the produced label from the classifier. 我只想将原始字符串与分类器中产生的标签连接起来。 Given this input: 鉴于此输入：

16800,Wed Dec 31 23:03:23 +0000 2014,null,DJVINCE1 on now till 8 with your New Year's Eve Countdown mix!!,0,neutral,null,djvince now till new year eve countdown mix 16800，Wed Dec 31 23:03:23 +0000 2014，null，DJVINCE1现在到8与您的除夕倒数混合!!，0，neutral，null，djvince现在直到除夕倒数混合

How can map the produced label to this input in order to get an output like this: 如何将产生的标签映射到此输入以获取如下输出：

16800,Wed Dec 31 23:03:23 +0000 2014,null,DJVINCE1 on now till 8 with your New Year's Eve Countdown mix!!,0,neutral,null,djvince now till new year eve countdown mix, label{0,1} 16800，2014年12月31日星期三23:03:23 + 0000，null，DJVINCE1现在到8与您的除夕倒计时混合！，0，中立，空，djvince现在直到除夕倒计时混合，标签{0， 1}

Answer 1

Ok as it seems all I had to do was to create tuples including my original text and the Vector[Double] from the hash function: 好的，好像我要做的就是从哈希函数创建元组，包括我的原始文本和Vector [Double]：

val parsedData = testing.map { line =>
  val parts = line.split(',')
  val text = parts(7).split(' ')
  (line, htf.transform(text))
}

Then feed them to the classifier and again create the tuple of the result plus the text. 然后将它们输入分类器，并再次创建结果和文本的元组。 Now I can use the structure which contains both fields I want. 现在，我可以使用包含我想要的两个字段的结构。

val predictionAndLabel2 = parsedData.map(p =>
  (p._1 , model.predict(p._2))
)

使用Scala进行Spark mllib分类

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-09-20 14:56:29

使用Scala进行Spark mllib分类

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-09-20 14:56:29

解决方案1
2 已采纳 2016-09-20 14:56:29