简体   繁体   中英

Spark: Collect/parallelize on RDD is “faster” than “doing nothing” on RDD

I'm trying to improve my Spark app code understanding "collect", and I'm dealing with this code:

  val triple = logData.map(x => x.split('@'))
                  .map(x => (x(1),x(0),x(2)))
                  .collect()
                  .sortBy(x => (x._1,x._2))
  val idx = sc.parallelize(triple)

Basically I'm creating a [String,String,String] RDD with an unneccesary (imho) collect/parallelize step (200k elements in the original RDD).

The Spark guide says: "Collect: Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data."

BTW: 200k is sufficiently small ?

I feel that this code should be "lighter" (with no collect-parallelize):

  val triple = logData.map(x => x.split('@'))
                  .map(x => (x(1),x(0),x(2)))
                  .sortBy(x => (x._1,x._2))
  val idx = triple

But after having runned (local not distributed) the same app many times, I always get faster times with the first code which in my opinion is doing an extra job (first collect then parallelize).

The entire app (not just this code snippet) takes on average 48 seconds in the first case, and at least 52 seconds in the second case.

How is this possible?

Thanks in advance

I think it is because the dataset is too small, in the later case you suffered the scheduling of shuffle to do the sort which could be faster when operating locally. when your dataset grows, it may even not possible to collect into driver.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM