简体   繁体   English

Apache Flink - 预测处理

[英]Apache Flink - Prediction Handling

I am currently working with Apache Flink's SVM-Class to predict some text data.我目前正在使用 Apache Flink 的 SVM-Class 来预测一些文本数据。

The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. class 提供了一个预测函数,它将 DataSet[Vector] 作为输入并给我一个 DataSet[Prediction] 作为结果。 So far so good.到目前为止,一切都很好。

My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.我的问题是,我没有预测属于哪个文本的上下文,并且我无法在 predict() 函数中插入文本以便之后拥有它。

Code:代码:

val tweets: DataSet[(SparseVector, String)] =
        source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
                .map(tweet => (featureVectorService.transform(tweet._2))

    model.predict(tweets).print


result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)

Is there a way to keep other data next to the prediction to have everything together?有没有办法将其他数据保留在预测旁边以将所有内容放在一起? because without context the prediction is not helping me.因为没有上下文,预测对我没有帮助。

Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.或者也许有一种方法可以只预测一个向量而不是数据集,我可以在上面的 map function 中调用 function。

The SVM predictor expects as input a sub type of Vector . SVM预测器期望作为输入的子类型Vector Hence there are two options to solve this problem:因此有两种方法可以解决这个问题:

  1. Create a sub type of Vector which contains the tweet text as a tag.创建一个包含推文文本作为标签的Vector子类型。 It will then be looped through the predictor.然后它将通过预测器循环。 This approach has the advantage that no additional operation is needed.这种方法的优点是不需要额外的操作。 However, one needs define new classes an utilities to represent different vector types with tags:但是,需要定义新的类和实用程序来用标签表示不同的向量类型:
val env = ExecutionEnvironment.getExecutionEnvironment

val input = env.fromElements("foobar", "barfo", "test")

val vectorizedInput = input.map(word => {
  val value = word.chars().sum()
  new DenseVectorWithTag(Array(value), word)
})

val svm = SVM().setBlocks(env.getParallelism)

val weights = env.fromElements(DenseVector(1.0))

svm.weightsOption = Option(weights) // skipping the training here

val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)

class DenseVectorWithTag(override val data: Array[Double], tag: String)
  extends DenseVector(data) {
  override def toString: String = "(" + super.toString + ", " + tag + ")"
}
  1. Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets .将预测DataSettweets的矢量化表示上的输入数据DataSet连接起来。 This approach has the advantage that we don't need to introduce new classes.这种方法的优点是我们不需要引入新的类。 The price we pay for this is an additional join operation which might be expensive:我们为此付出的代价是额外的连接操作,这可能很昂贵:
val input = env.fromElements("foobar", "barfo", "test")

val vectorizedInput = input.map(word => {
  val value = word.chars().sum()
  (DenseVector(value), word)
})

val svm = SVM().setBlocks(env.getParallelism)

val weights = env.fromElements(DenseVector(1.0))

svm.weightsOption = Option(weights) // skipping the training here

val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
  .join(predictionResult)
  .where(0)
  .equalTo(0)
  .apply((t, p) => (t._2, p._2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM