简体   繁体   English

不使用迭代的Spark RDD计算

[英]Calculations on Spark RDD without using Iterations

I'm trying to implement MAP (Mean Average Precision), and so far everything works, However I reached the stage where I need to make the calculations on the RDD. 我正在尝试实现MAP(平均平均精度),到目前为止一切正常,但是我到达了需要在RDD上进行计算的阶段。 (without using iterations, rdd.collect() isn't an option) (不使用迭代,则不能选择rdd.collect()

here's the final generated RDD (Actual and predicted ratings along with index) that on it I would like to do the calculations : 这是最终生成的RDD(实际和预期收视率以及指数),我要在其上进行计算:

JavaPairRDD<Tuple2<Double, Double>, Long> actualAndPredictedSorted = actual.join(predictions).mapToPair(
                new PairFunction<Tuple2<Tuple2<Integer,Integer>,Tuple2<Double,Double>>, Double, Double>() {
                    public Tuple2<Double,Double> call(Tuple2<Tuple2<Integer,Integer>,Tuple2<Double,Double>> t) {
                        return new Tuple2 < Double, Double > (t._2._2, t._2._1);
                    }
        }).sortByKey(false).zipWithIndex();

As well below you can find an image explaining how the calculation is done. 同样在下面,您可以找到一张图像,解释如何进行计算。 for example an entry will get calculated(green considered as a hit) if user's actual rating in the rdd is above 3/5 例如,如果用户在rdd中的实际评分高于3/5,则该条目将被计算(绿色视为命中)

在此处输入图片说明

I hope I explained myself! 我希望我自己解释一下!

You need filtering, not iterating. 您需要过滤,而不是迭代。

It can be achieved by 可以通过

  1. Filtering ( Keeping ratings only which meet the conditions). 过滤(仅保留符合条件的等级)。
  2. Adding all of them 全部添加
  3. Dividing by number of entries. 除以条目数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM