[英]I cannot make dataframe using streaming mode for online prediction in apache spark using scala
I am new in spark and I would like to make a streaming program. 我是新手,我想做一个流媒体节目。 I need to predict a number repetition for each of my rows. 我需要预测每行的重复次数。 Here is my raw data: 这是我的原始数据:
05:49:56.604899 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 10202: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [.], seq 3641977583:3641987719, ack 129899328, win 58, options [nop,nop,TS val 432623 ecr 432619], length 10136
05:49:56.604908 00:00:00:00:00:03 > 00:00:00:00:00:02, ethertype IPv4 (0x0800), length 66: 10.0.0.3.5001 > 10.0.0.2.54880: Flags [.], ack 10136, win 153, options [nop,nop,TS val 432623 ecr 432623], length 0
05:49:56.604900 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 4410: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [P.], seq 10136:14480, ack 1, win 58, options [nop,nop,TS val 432623 ecr 432619], length 4344
I wrote a code that extract my suitable output like below. 我写了一个代码,提取我的合适输出,如下所示。 (I needed the number of repetition on column1 and column2) (我需要在column1和column2上重复的次数)
Here is my code: 这是我的代码:
However my code is not in a streaming mode. 但是我的代码不是流式传输模式。 I did another code to obtain a streaming mode. 我做了另一个代码来获得流模式。 Because the train.csv file is generating in streaming way. 因为train.csv文件是以流方式生成的。 But I got some errors. 但是我遇到了一些错误。 Here is my streaming code: 这是我的流媒体代码:
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressionWithSGD}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.util.Try
/**
* Created by saeedtkh on 5/24/17.
*/
object Main_ML_with_Streaming {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("saeed_test").setMaster("local[*]")
//val sc = new SparkContext(conf)
val ssc = new StreamingContext(conf, Seconds(5))
/////////////////////Start extract the packet
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val rdd = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val rowRdd =rdd.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
val dataFrame_trainingData = sqlContext.createDataFrame(rowRdd, customSchema)
dataFrame_trainingData.groupBy("column1","column2").count().show()
/////////////////////end extract the packet
val testData = ssc.textFileStream(/Users/saeedtkh/Desktop/sharedsaeed/test.csv).map(LabeledPoint.parse)
////////////////////end trainging and testing
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
model.trainOn(dataFrame_trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
print("Here is the anwser: *****########*********#########*******222")
}
}
The problem is that, I can not create a dataframe using sqlcontext in this line in my code: 问题是,我无法在我的代码中使用sqlcontext创建一个数据帧:
val dataFrame_trainingData = sqlContext.createDataFrame(rowRdd, customSchema)
Can any body help me to modify this code that works in streaming way and predicts repetition of each row using Linear Regression or any other algorithm. 任何正文都可以帮我修改这个以流方式工作的代码,并使用线性回归或任何其他算法预测每行的重复。 Thanks a lot. 非常感谢。
Update1: Acoording to answer number one, I added foreach but errors are still exists: Update1:根据第一个答案,我添加了foreach但错误仍然存在:
First, it's important to note that ssc.textFileStream
returns a DStream
and not an RDD
, so the variables you named rdd
, rowRdd
and testData
are not really RDDs, but rather abstractions over a continuous sequence of RDDs. 首先,重要的是要注意ssc.textFileStream
返回DStream
而不是RDD
,因此命名为rdd
, rowRdd
和testData
的变量不是真正的RDD,而是连续RDD序列的抽象。 Therefore, you cannot pass these to createDataFrame
which expects RDDs. 因此,您无法将这些传递给期望RDD的createDataFrame
。
You can create a DataFrame out of each underlying RDD using DStream.foreachRDD
, as described here : 您可以创建一个数据帧进行分别使用RDD底层的DStream.foreachRDD
,描述在这里 :
rowRdd.foreachRDD { rdd =>
val dataFrame_trainingData = sqlContext.createDataFrame(rdd, customSchema)
// ...
}
However, you should notice that StreamingLinearRegressionWithSGD
expects DStreams as inputs for trainOn
and predictOnValues
- so you can simply pass the original DStreams without converting them into DataFrames. 但是,您应该注意到StreamingLinearRegressionWithSGD
期望DStreams作为trainOn
和predictOnValues
输入 - 因此您可以简单地传递原始DStream而不将它们转换为DataFrames。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.