I'm learning Spark and its parallelism that relates to RDD partition distributions. I have a 4 CPU machine hence I have 4 units of parallelism. To return the members of partition index "0" I couldn't find a way to return this partition without forcing the RDD to use a localIterator.
I'm used to spark being quite terse. Is there a more concise way to filter an RDD by partition? The following two methods work, but it seems clumsy.
scala> val data = 1 to 20
data: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[75] at parallelize at <console>:26
scala> distData.mapPartitionsWithIndex{
(index,it) => {
it.toList.map(x => if (index == 0) (x)).iterator
}
}.toLocalIterator.toList.filterNot(
_.isInstanceOf[Unit]
)
res107: List[AnyVal] = List(1, 2, 3, 4, 5)
scala> distData.mapPartitionsWithIndex{
(index,it) => {
it.toList.map(x => if (index == 0) (x)).iterator
}
}.toLocalIterator.toList.filter(
_ match{
case x: Unit => false
case x => true
}
)
res108: List[AnyVal] = List(1, 2, 3, 4, 5)
distData.mapPartitionsWithIndex{ (index, it) =>
if (index == 0) it else Array[Int]().iterator
}
You can return an empty iterator and it will work fine.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.