简体   繁体   中英

What is the difference between union and zipPartitions for Apache Spark RDDs?

I am attempting to union to RDDs that are already distributed across our cluster with hash partitioning on the key. I do not need to preserve any ordering or even partitioning, I just want the union to be as fast as possible. In this example, I actually DO want all records, not just distinct ones, but keep multiplicity.

Here is what I would naively use:

val newRDD = tempRDD1.union(tempRDD2)

here is what someone recommended to me as being faster, as it takes advantage of how the RDDs are already partitioned and distributed:

val newRDD = tempRDD1.zipPartitions(tempRDD2, preservesPartitioning=true)((iter, iter2) => iter++iter2)

Which of these is faster? And are the results completely consistent, member-wise?

I ask this because up until now I thought these methods were equivalent, but when I cranked up the scale of my data and number of partitions, executors, memory, and such, I am getting weird results for the zipPartitions method, which isn't working correctly with reduceByKey afterwards.

Perhaps my differences are due to my RDDs themselves, which have the form ((String, String), (String, Long, Long, Long, Long)), so maybe iter++iter2 is doing something other than unioning those values?

Is zipPartitions implicitly doing anything extra, like a comparison sort, or re-hashing things, or in general implementing the merge differently than union?

Will union-vs-zipPartitions return different results if the RDDs contain non-distinct rows, or multiple copies of keys, or have empty partitions, or hash collisions of the keys, or any other such issues?

Yes, I could run tests myself (in fact, I've done so for the past 2 days!), so please don't post anything stupid asking me if I've tried such-and-such... I'm asking this question to better understand what is happening at the code-level under-the-covers. Was "union" written as a subcase of "zipPartitions"?

Later edit: adding in some examples with toDebugString results, as recommended by @Holden

val tempIntermediateRDD6 = tempIntermediateRDD1.
  zipPartitions(tempIntermediateRDD2, true)((iter, iter2) => iter++iter2).
  zipPartitions(tempIntermediateRDD5, true)((iter, iter2) => iter++iter2).
  partitionBy(partitioner).
  setName("tempIntermediateRDD6").
  persist(StorageLevel.MEMORY_AND_DISK_SER)

tempIntermediateRDD6.checkpoint

println(tempIntermediateRDD6.toDebugString)

// (2568) tempIntermediateRDD6 ZippedPartitionsRDD2[169] at zipPartitions at mycode.scala:3203 [Disk Memory Serialized 1x Replicated]
//   |    ZippedPartitionsRDD2[168] at zipPartitions at mycode.scala:3202 [Disk Memory Serialized 1x Replicated]
//   |    tempIntermediateRDD1 ShuffledRDD[104] at partitionBy at mycode.scala:2824 [Disk Memory Serialized 1x Replicated]
//   |        CachedPartitions: 2568; MemorySize: 200.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
//   |    CheckpointRDD[105] at count at mycode.scala:2836 [Disk Memory Serialized 1x Replicated]
//   |    tempIntermediateRDD2 ShuffledRDD[116] at partitionBy at mycode.scala:2900 [Disk Memory Serialized 1x Replicated]
//   |    CheckpointRDD[117] at count at mycode.scala:2912 [Disk Memory Serialized 1x Replicated]
//   |    tempIntermediateRDD5 MapPartitionsRDD[163] at distinct at mycode.scala:3102 [Disk Memory Serialized 1x Replicated]
//   |        CachedPartitions: 2568; MemorySize: 550.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
//   |    CheckpointRDD[164] at count at mycode.scala:3113 [Disk Memory Serialized 1x Replicated]

versus:

val tempIntermediateRDD6 = tempIntermediateRDD1.
  union(tempIntermediateRDD2).
  union(tempIntermediateRDD5).
  partitionBy(partitioner).
  setName("tempIntermediateRDD6").
  persist(StorageLevel.MEMORY_AND_DISK_SER)

tempIntermediateRDD6.checkpoint

println(tempIntermediateRDD6.toDebugString)

// (2568) tempIntermediateRDD6 ShuffledRDD[170] at partitionBy at mycode.scala:3208 [Disk Memory Serialized 1x Replicated]
//   +-(5136) UnionRDD[169] at union at mycode.scala:3207 [Disk Memory Serialized 1x Replicated]
//       |    PartitionerAwareUnionRDD[168] at union at mycode.scala:3206 [Disk Memory Serialized 1x Replicated]
//       |    tempIntermediateRDD1 ShuffledRDD[104] at partitionBy at mycode.scala:2824 [Disk Memory Serialized 1x Replicated]
//       |        CachedPartitions: 2568; MemorySize: 200.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
//       |    CheckpointRDD[105] at count at mycode.scala:2836 [Disk Memory Serialized 1x Replicated]
//       |    tempIntermediateRDD2 ShuffledRDD[116] at partitionBy at mycode.scala:2900 [Disk Memory Serialized 1x Replicated]
//       |    CheckpointRDD[117] at count at mycode.scala:2912 [Disk Memory Serialized 1x Replicated]
//       |    tempIntermediateRDD5 MapPartitionsRDD[163] at distinct at mycode.scala:3102 [Disk Memory Serialized 1x Replicated]
//       |        CachedPartitions: 2568; MemorySize: 550.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
//       |    CheckpointRDD[164] at count at mycode.scala:3113 [Disk Memory Serialized 1x Replicated]

Union returns a specialized UnionRDD , we can see how it was written by looking at UnionRDD.scala in the Spark project. Looking at it we can see that Union is actually implemented using this block of code:

  override def getPartitions: Array[Partition] = {
    val array = new Array[Partition](rdds.map(_.partitions.length).sum)
    var pos = 0
    for ((rdd, rddIndex) <- rdds.zipWithIndex; split <- rdd.partitions) {
      array(pos) = new UnionPartition(pos, rdd, rddIndex, split.index)
      pos += 1
    }
    array
  }

If you're curious as to what the underlying computation looks like on an RDD, I'd recommend using the toDebugString function on the resulting RDD. You can then see what the dependency DAG looks like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM