不使用迭代的Spark RDD计算

Question

I'm trying to implement MAP (Mean Average Precision), and so far everything works, However I reached the stage where I need to make the calculations on the RDD. 我正在尝试实现MAP（平均平均精度），到目前为止一切正常，但是我到达了需要在RDD上进行计算的阶段。 (without using iterations, rdd.collect() isn't an option) （不使用迭代，则不能选择rdd.collect() ）

here's the final generated RDD (Actual and predicted ratings along with index) that on it I would like to do the calculations : 这是最终生成的RDD（实际和预期收视率以及指数），我要在其上进行计算：

JavaPairRDD<Tuple2<Double, Double>, Long> actualAndPredictedSorted = actual.join(predictions).mapToPair(
                new PairFunction<Tuple2<Tuple2<Integer,Integer>,Tuple2<Double,Double>>, Double, Double>() {
                    public Tuple2<Double,Double> call(Tuple2<Tuple2<Integer,Integer>,Tuple2<Double,Double>> t) {
                        return new Tuple2 < Double, Double > (t._2._2, t._2._1);
                    }
        }).sortByKey(false).zipWithIndex();

As well below you can find an image explaining how the calculation is done. 同样在下面，您可以找到一张图像，解释如何进行计算。 for example an entry will get calculated(green considered as a hit) if user's actual rating in the rdd is above 3/5 例如，如果用户在rdd中的实际评分高于3/5，则该条目将被计算（绿色视为命中）

I hope I explained myself! 我希望我自己解释一下！

Answer 1

You need filtering, not iterating. 您需要过滤，而不是迭代。

It can be achieved by 可以通过

Filtering ( Keeping ratings only which meet the conditions). 过滤（仅保留符合条件的等级）。
Adding all of them 全部添加
Dividing by number of entries. 除以条目数。

不使用迭代的Spark RDD计算

问题描述

1 个解决方案

解决方案1
0 2016-06-20 18:22:59

不使用迭代的Spark RDD计算

问题描述

1 个解决方案

解决方案1 0 2016-06-20 18:22:59

解决方案1
0 2016-06-20 18:22:59