简体   繁体   English

Apache Flink - 对流数据的 svm 预测

[英]Apache Flink - svm predictions on streaming data

I am using Apache Flink to predict streams from Twitter.我正在使用 Apache Flink 来预测来自 Twitter 的流。

Code is implemented in Scala代码在 Scala 中实现

My Problem is, that my trained SVM-Model from the DataSet API needs a DataSet as an input for the predict()-Method.我的问题是,我从数据集 API 训练的 SVM 模型需要一个数据集作为 predict() 方法的输入。

I saw already a Question here, where a user said, that you need to write a own MapFunction which reads the model upon start of the job (ref: Real-Time streaming prediction in Flink using scala )我已经在这里看到了一个问题,用户说,您需要编写一个自己的 MapFunction,它在工作开始时读取 model(参考: 使用 scala 在 Flink 中进行实时流预测

But i am not able to write/understand this code.但我无法编写/理解这段代码。

Even if i get the model inside the StreamingMapFunction.即使我在 StreamingMapFunction 中得到 model。 I still need a DataSet as a Parameter to predict the result.我仍然需要一个 DataSet 作为参数来预测结果。

I really hope someone can show/explain me how this is done.我真的希望有人可以向我展示/解释这是如何完成的。

Flink-Version: 1.9 Scala-Version: 2.11 Flink-ML:2.11 Flink 版本:1.9 Scala 版本:2.11 Flink-ML:2.11

val strEnv = StreamExecutionEnvironment.getExecutionEnvironment
val env = ExecutionEnvironment.getExecutionEnvironment

//this is my Model including all the terms to calculate the tfidf-values and to create a libsvm
val featureVectorService = new FeatureVectorService
        featureVectorService.learnTrainingData(labeledData, false)

//reads the created libsvm
val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env, "...")
        val svm = SVM()
                .setBlocks(env.getParallelism)
                .setIterations(100)
                .setRegularization(0.001)
                .setStepsize(0.1)
                .setSeed(42)
//learning
svm.fit(trainingData)

//this is my twitter stream - text should be predicted later
val streamSource: DataStream[String] = strEnv.addSource(new TwitterSource(params.getProperties))

//the texts i want to transform to tfidf using the service upon and give it the svm to predict
val tweets: DataStream[(String, String)] = streamSource
                .flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)

So, currently the FlinkML, which SVM is part of, does not support the streaming API.因此,目前SVM所属的 FlinkML 不支持流式传输 API。 That is why SVM accepts only DataSet .这就是SVM只接受DataSet的原因。 The idea is not to use the FlinkML, but rather some SVM library available in scala or java.这个想法不是使用 FlinkML,而是使用 scala 或 java 中的一些 SVM 库。 Then you could read the model, for example from file.然后您可以读取 model,例如从文件中读取。 The issue is that You have to implement most of the logic by Yourself.问题是您必须自己实现大部分逻辑。

The comment in the post You have mentioned is more or less saying the exact same thing.您提到的帖子中的评论或多或少说的是完全相同的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM