简体   繁体   中英

Spark RDD/DF as variable member of class. Does it affect performance?

I would like to ask one theoretical question.

Imagine we have our main Object as:

val pathToFile: String = "/some/path/to/file.csv" 
val rddLoader: RddLoader = new RddLoader(pathToFile)
val rdd = rddLoader.load()
def transformer = new Transformer(rdd)
transformer.transform1(someOtherRdd)
transformer.transform2(yetAnotherRdd)

And out Transformer defined as (pseudo code)

class Transformed(rdd: RDD[sruct]) {

  val rddToTransform = rdd.someTransformation

  def complexTransformations1(anotherRdd: RDD[struct]) = {
     rddToTransform.complexTransformationsInvlovingAnotherRdd
  }

  def complexTransformations2(anotherRdd: RDD[struct]) = {
     rddToTransform.complexTransformations2InvlovingAnotherRdd
  }
}

Will the fact that rddToTransfrom is member of class thus member of instance of the class impact performance. I guess the whole class will be serialized. But will this cause rddToTransform be serialized for each partition thus multiple times.

Will the below be better in terms of performance, overhead seriallization, etc. In it we are using an Object and our RDD is not member of a class but just passed as parameter to method.

val pathToFile: String = "/some/path/to/file.csv" 
val rddLoader: RddLoader = new RddLoader(pathToFile)
val rdd = rddLoader.load()
def transformer = Transformer
transformer.transform1(rdd, someOtherRdd)
transformer.transform2(rdd, yetAnotherRdd)


object Transformer {

  def complexTransformations1(rdd, anotherRdd: RDD[struct]) = {
     rddToTransform.complexTransformationsInvlovingAnotherRdd
  }

  def complexTransformations2(rdd, anotherRdd: RDD[struct]) = {
     rddToTransform.complexTransformations2InvlovingAnotherRdd
  }
}

I can give an example with broadcast variables. I understand their way of working I am just wondering if what is explained below would apply also to RDDs and if we need to avoid using RDDs as in the first example (member of class)

Let's say we have a large dataset, with 420 partitions and a cluster of 8 executor nodes. In an operation like:

val referenceData = Map(...)
val filtered = rdd.filter(elem => referenceData.get(elem) > 10)

The referenceData object will be serialized 420 times, or as many tasks as are required to execute the transformation.

Instead, a broadcast variable:

val referenceDataBC = sparkContext.broadcast(Map(...))
val filtered = rdd.filter(elem => referenceDataBC.value.get(elem) > 10)

will be sent once to each executor, or 8 times total. Hence, saving a lot of network and CPU by reducing the serialization overhead.

Short answer is yes, in normal case broadcast variables are better memory optimisations but there are cases where they are not usable.

To understand better:

Apache Spark has two types abstractions. The main abstraction Spark provides is Resilient Distributed Dataset(RDD) and another is Shared Variables.

Shared variables: Shared variables are the variables that are required to be used by many functions & methods in parallel. Shared variables can be used in parallel operations.

Spark segregates the job into the smallest possible operation, a closure, running on different nodes and each having a copy of all the variables of the Spark job. Any changes made to these variables doesn't reflect in the driver program and hence to overcome this limitation Spark provides two special type of shared variables – Broadcast Variables and Accumulators.

Broadcast variables: Used to cache a value in memory on all nodes. Here only one instance of this read-only variable is shared between all computations throughout the cluster. Spark sends the broadcast variable to each node concerned by the related task. After that, each node caches it locally in serialised form. Now before executing each of the planned tasks instead of getting values from the driver system retrieves them locally from the cache. Broadcast variables are:

Immutable (Unchangeable), Distributed ie broadcasted to the cluster, Fit in memory

Accumulators: As its name suggests Accumulators main role is to accumulate values. The accumulator is variables that are used to implement counters and sums. Spark provides accumulators of numeric type only. The user can create named or unnamed accumulators. Unlike Broadcast Variables, accumulators are writable. However, written values can be only read in driver program. It's why accumulators work pretty well as data aggregators.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM