简体   繁体   English

SPARK:为不同的分区维护不同的变量?

[英]SPARK: Maintaining different variables for different partitions?

Let's say I have some data like:假设我有一些数据,例如:

AB Value AB值
1 1 40 1 1 40
1 2 3 1 2 3
1 2 5 1 2 5
2 1 6 2 1 6
2 2 10 2 2 10

In a dataframe (say 'df').在数据框中(比如“df”)。 and I have partitioned it on both A and B as:我已经在 A 和 B 上将其分区为:

df.repartition($"A",$"B")

Now, Let's say we are supposed to count the number of values that are divisible by 2 or by 5 in each partition (separately).现在,假设我们应该计算每个分区中可被 2 或 5 整除的值的数量(单独)。 It would be unreasonable to maintain as many variables as the number of partitions available.维护与可用分区数量一样多的变量是不合理的。 What is the most optimal way to go about this?解决此问题的最佳方法是什么?

(Kindly offer solutions that are applicable in Spark 1.6+) (请提供适用于 Spark 1.6+ 的解决方案)

you can you .mapPartition transformation to do any specific calculation for specific partitions.for ex:您可以通过 .mapPartition 转换对特定分区进行任何特定计算。例如:

rdd.mapPartition{x=> 
var s=0
x.map{
   //operation on elements of each partition 
} 
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM