简体   繁体   English

计算RDD中的平均值,然后根据Spark Streaming中的平均值过滤此RDD

[英]Compute an average in a RDD and then filter this RDD based on the average in Spark Streaming

I want to do something i feel very strange in Spark Streaming and i want to get some feedback on it. 我想在Spark Streaming中做一些我觉得很奇怪的事情,我希望得到一些反馈。

I have a DStream of a tuple (String,Int). 我有一个元组的DStream(String,Int)。 Let's say the string is an id and the integer a value. 假设字符串是id,整数是值。

So for a microbatch, I want to compute The average of the field Int, and base on this average i want to filter the same microbatch, for example field2 > average. 因此,对于微量分析,我想计算字段Int的平均值,并且基于该平均值,我想过滤相同的微量分批,例如field2> average。 So i wrote this code : 所以我写了这段代码:

lineStreams
  .foreachRDD(
    rdd => {
      val totalElement = rdd.count()
      if(totalElement > 0) {
        val totalSum = rdd.map(elem => elem.apply(1).toInt).reduce(_ + _)
        val average = totalSum / totalElement
        rdd.foreach(
          elem => {
            if(elem.apply(1).toInt > average){
              println("Element is higher than average")
            }
          }
        )
      }
    })

But actually this code is not running, the first part of computation looks ok but not the test. 但实际上这段代码没有运行,计算的第一部分看起来不错,但不是测试。 I know there is some dirty things in this code, but I just want to know if the logic is good. 我知道这段代码中有一些脏东西,但我只想知道逻辑是否正常。

Thanks for you advices ! 谢谢你的建议!

Try: 尝试:

lineStreams.transform { rdd => {
  val mean = rdd.values.map(_.toDouble).mean
  rdd.filter(_._2.toDouble > mean)
}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM