简体   繁体   中英

Concurrent operations on Spark DataFrame

I need to do different filter operations on a DataFrame and count, then do sum of individual counts. I use Scala Future for concurrent executions. Here is the code:

import scala.concurrent.{Await, Future, blocking}
import scala.concurrent.ExecutionContext.Implicits.global

val f1  = Future{myDF.filter("pmod(idx, 8) = 1").count}
val f2  = Future{myDF.filter("pmod(idx, 8) = 2").count}
val f3  = Future{myDF.filter("pmod(idx, 8) = 3").count}

val future = for {c1 <- f1; c2 <- f2; c3 <- f3} yield {
   c1 + c2 + c3 
}

val summ = Await.result(future, 180 second)

The run time of each filter/count operation takes about 7 seconds. However, after running many times, the total time of concurrent execution always takes about 35 seconds, instead of 7 seconds that I expected. I have been puzzled by this behavior for quite some time, but cannot figure it out.

I have a cluster of 3 machines, one master node, two worker nodes, and each node with 128G memory and 32 cores. The size of data is about 3G. I noticed that during concurrent execution, one worker node has 20 seconds of GC time. I have tuned GC such that individual filter/count operation almost has no GC time. I am not sure why GC kicks in whenever I run concurrent executions of 3 Futures, and whether it is the reason that makes concurrent execution time longer.

Anyone has experiences on this issue?

Jobs are scheduled across your cluser in a sequential manner, because each job in your script is a node in a DAG of jobs which defines the precedence relations between the data they manipulate. And any successful executions of your whole script must, well, respect that precedence.

This rule applies even when there is no anteriority relation between your jobs (though they all depend on the same data, myDF). And your usage of Futures only means your jobs are submitted to the scheduler near-simultaneously, not that they end up being scheduled in such a manner.

If you want parallelism, you should write it within a job, with something like:

myDF.filter("pmod(idx,8) < 4 && pmod(idx,8) > 0").groupBy("pmod(idx,8)").count()

And yes, you should cache myDf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM