Spark RDD相当于Scala集合分区

Question

This is a minor issue with one of my spark jobs which doesn't seem to cause any issues -- yet annoys me every time I see it and fail to come up with a better solution. 这是我的一个火花工作的一个小问题，似乎没有引起任何问题 - 但每次我看到它并且未能提出更好的解决方案时，我都会烦恼。

Say I have a Scala collection like this: 假设我有一个像这样的Scala集合：

val myStuff = List(Try(2/2), Try(2/0))

I can partition this list into successes and failures with partition: 我可以使用分区将此列表分为成功和失败：

val (successes, failures) =  myStuff.partition(_.isSuccess)

Which is nice. 这很好。 The implementation of partition only traverses the source collection once to build the two new collections. 分区的实现仅遍历源集合一次以构建两个新集合。 However, using Spark, the best equivalent I have been able to devise is this: 但是，使用Spark，我能够设计的最佳等价物是这样的：

val myStuff: RDD[Try[???]] = sourceRDD.map(someOperationThatMayFail)
val successes: RDD[???] = myStuff.collect { case Success(v) => v }
val failures: RDD[Throwable] = myStuff.collect { case Failure(ex) => ex }

Which aside from the difference of unpacking the Try (which is fine) also requires traversing the data twice . 除了解压缩Try的区别（这很好）之外，还需要遍历数据两次。 Which is annoying. 这很烦人。

Is there any better Spark alternative that can split an RDD without multiple traversals? 有没有更好的Spark替代方案可以在没有多次遍历的情况下拆分RDD？ ie having a signature something like this where partition has the behaviour of Scala collections partition rather than RDD partition: 即具有这样的签名，其中分区具有Scala集合分区而不是RDD分区的行为：

val (successes: RDD[Try[???]], failures: RDD[Try[???]]) = myStuff.partition(_.isSuccess)

For reference, I previously used something like the below to solve this. 作为参考，我以前使用类似下面的东西来解决这个问题。 The potentially failing operation is de-serializing some data from a binary format, and the failures have become interesting enough that they need to be processed and saved as an RDD rather than something logged. 可能失败的操作是从二进制格式反序列化一些数据，并且失败已经变得足够有趣，它们需要被处理并保存为RDD而不是记录的东西。

def someOperationThatMayFail(data: Array[Byte]): Option[MyDataType] = {
   try {
      Some(deserialize(data))
   } catch {
      case e: MyDesrializationError => {
         logger.error(e)
         None
      }
   }
}

Answer 1

There might be other solutions, but here you go: 可能还有其他解决方案，但是你去了：

Setup: 建立：

import scala.util._
val myStuff = List(Try(2/2), Try(2/0))
val myStuffInSpark = sc.parallelize(myStuff)

Execution: 执行：

val myStuffInSparkPartitioned = myStuffInSpark.aggregate((List[Try[Int]](),List[Try[Int]]()))(
  (accum, curr)=>if(curr.isSuccess) (curr :: accum._1,accum._2) else (accum._1, curr :: accum._2), 
  (first, second)=> (first._1 ++ second._1,first._2 ++ second._2))

Let me know if you need an explanation 如果您需要解释，请告诉我

Spark RDD相当于Scala集合分区

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-03-15 23:16:00

Spark RDD相当于Scala集合分区

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-03-15 23:16:00

解决方案1
1 已采纳 2015-03-15 23:16:00