简体   繁体   中英

SPARK: Maintaining different variables for different partitions?

Let's say I have some data like:

AB Value
1 1 40
1 2 3
1 2 5
2 1 6
2 2 10

In a dataframe (say 'df'). and I have partitioned it on both A and B as:

df.repartition($"A",$"B")

Now, Let's say we are supposed to count the number of values that are divisible by 2 or by 5 in each partition (separately). It would be unreasonable to maintain as many variables as the number of partitions available. What is the most optimal way to go about this?

(Kindly offer solutions that are applicable in Spark 1.6+)

you can you .mapPartition transformation to do any specific calculation for specific partitions.for ex:

rdd.mapPartition{x=> 
var s=0
x.map{
   //operation on elements of each partition 
} 
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM