简体   繁体   中英

Iteratively running queries on Apache Spark

I've been trying to execute 10,000 queries over a relatively large dataset 11M. More specifically I am trying to transform an RDD using filter based on some predicate and then compute how many records conform to that filter by applying the COUNT action.

I am running Apache Spark on my local machine having 16GB of memory and an 8-core CPU. I have set the --driver-memory to 10G in order to cache the RDD in memory.

However, because I have to re-do this operation 10,000 times it takes unusually long for this to finish. I am also attaching my code hoping it will make things more clear.

Loading the queries and the dataframe I am going to query against.

//load normalized dimensions
val df = spark.read.parquet("/normalized.parquet").cache()
//load query ranges
val rdd = spark.sparkContext.textFile("part-00000")

Parallelizing the execution of queries

In here, my queries are collected in a list and using par are executed in parallel. I then collect the required parameters that my query needs, to filter the Dataset. The isWithin function calls a function and tests whether the Vector contained in my dataset is within the given bounds by my queries.

Now after filtering my dataset, I execute count to get the number of records that exist in the filtered dataset and then create a string reporting how many that was.

val results = queries.par.map(q => {
  val volume = q(q.length-1)
  val dimensions = q.slice(0, q.length-1)
  val count = df.filter(row => {
    val v = row.getAs[DenseVector]("scaledOpen")
    isWithin(volume, v, dimensions)
  }).count
  q.mkString(",")+","+count
})

Now, what I have in mind is that this task is generally really hard given the large dataset that I have and trying to run such thing on a single machine. I know this could be much faster on something running on top of Spark or by utilizing an index. However, I am wondering if there is a way to make it faster as it is.

Just because you parallelize access to a local collection it doesn't mean that anything is executed in parallel. Number of jobs that can be executed concurrently is limited by the cluster resources not driver code.

At the same time Spark is designed for high latency batch jobs. If number of jobs goes into tens of thousands you just cannot expect things to be fast.

One thing you can try is to push filters down into a single job. Convert DataFrame to RDD :

import org.apache.spark.mllib.linalg.{Vector => MLlibVector}
import org.apache.spark.rdd.RDD

val vectors: RDD[org.apache.spark.mllib.linalg.DenseVector] = df.rdd.map(
  _.getAs[MLlibVector]("scaledOpen").toDense
)

map vectors to {0, 1} indicators:

import breeze.linalg.DenseVector

// It is not clear what is the type of queries
type Q = ???
val queries: Seq[Q] = ???

val inds: RDD[breeze.linalg.DenseVector[Long]] = vectors.map(v => {
  //  Create {0, 1} indicator vector
  DenseVector(queries.map(q => {
    // Define as before
    val volume = ???
    val dimensions = ???

    // Output 0 or 1 for each q
    if (isWithin(volume, v, dimensions)) 1L else 0L
  }): _*)
})

aggregate partial results:

val counts: breeze.linalg.DenseVector[Long] = inds
  .aggregate(DenseVector.zeros[Long](queries.size))(_ += _, _ += _)

and prepare final output:

queries.zip(counts.toArray).map {
  case (q, c) => s"""${q.mkString(",")},$c"""
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM