SPARK：为不同的分区维护不同的变量？

Question

Let's say I have some data like:假设我有一些数据，例如：

AB Value AB值
1 1 40 1 1 40
1 2 3 1 2 3
1 2 5 1 2 5
2 1 6 2 1 6
2 2 10 2 2 10

In a dataframe (say 'df').在数据框中（比如“df”）。 and I have partitioned it on both A and B as:我已经在 A 和 B 上将其分区为：

df.repartition($"A",$"B")

Now, Let's say we are supposed to count the number of values that are divisible by 2 or by 5 in each partition (separately).现在，假设我们应该计算每个分区中可被 2 或 5 整除的值的数量（单独）。 It would be unreasonable to maintain as many variables as the number of partitions available.维护与可用分区数量一样多的变量是不合理的。 What is the most optimal way to go about this?解决此问题的最佳方法是什么？

(Kindly offer solutions that are applicable in Spark 1.6+) （请提供适用于 Spark 1.6+ 的解决方案）

Answer 1

you can you .mapPartition transformation to do any specific calculation for specific partitions.for ex:您可以通过 .mapPartition 转换对特定分区进行任何特定计算。例如：

rdd.mapPartition{x=> 
var s=0
x.map{
   //operation on elements of each partition 
} 
}

SPARK：为不同的分区维护不同的变量？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-10-13 06:02:21

SPARK：为不同的分区维护不同的变量？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-10-13 06:02:21

解决方案1
1 已采纳 2016-10-13 06:02:21