How to write Spark Optimization to recalculate DataFrame when array column contains more values than a threshold?

Question

Spark 2.1.0 with Scala 2.10.6

I've been looking for tutorials and help regarding extending Spark Catalyst optimization using custom rules. All the examples I have found are quite basic and usually use the Multiplication with a 1 as an example to show a logical rule. I wanted to ask is it possible to write a custom rule for the following case.

I have a dataframe with an array in a column (column _2). The array contains N values for each row. I need to recalculate this dataframe with a new value M such that the array would now contain M values. Now the optimization I want to apply is that if N>M, the strategy should just return the dataframe with the top M values instead of recalculating the entire dataframe. Otherwise call the function to calculate the dataframe. So basically

Rule (distance:Dataframe, M:int){
    distance.registerTempTable("tab1")
    val N=sqlContext.sql("select size(_2) from tab1")
    if(N>M){
        // Query to select only top M values from column _2 to create distanceNew
        return distanceNew
    }else{
        //call function to calculate distance from scratch (A time consuming process)
        return getNNeighbors(distance,M)
    }
}

Is it possible to write such a custom Rule in Spark Catalyst, if so can I get some guidance on how to do it? Or is there any other way to define custom run time optimization rules based on pattern matching on Spark SQL that I can use.

I already have written a function that does this, but my aim is to write it in some optimization pattern matching API as a prove of concept that generic rules can be written to optimize algorithms.

Answer 1

If I understood your question correct, you want to recreate the array in your dataframe if the condition matches. right? If that is the case then you don't need to recreate dataframe , just use udf funtion
following is not the complete solution but should be helpful
define the udf funtion , assuming that the array is of double values

def testUdf = udf((value: Array[Double], M : Int) => {
    val N = value.size
    if(N > M) //return the new array
    else //return the array you need
  })

and call it with the following

dataframe.withColumn("column 2 name", testUdf(dataframe("column 2 name")))

Answer 2

I don't really think your case has anything to do with Spark SQL's Optimizer Rules (or batches thereof) as an optimization rule would have to work with the content of the (business) data not the structure of the query or the dataset. It would not be a optimization rule but a Dataset transformation.

I don't think it merits a new optimization rule.

With that said, why don't you do the following:

// check the length of the array column
val theArrayColumn = ...
val size = distance.select(size(theArrayColumn) as "size").orderBy("size").as[Int].head
val inputDF = if (size > M) {
  recalculateDistance()
} else {
  // we're fine
  distance
}
getNNeighbors(inputDF)

If you describe getNNeighbors with more detail, that could get even simpler.

size is a function in functions object .

How to write Spark Optimization to recalculate DataFrame when array column contains more values than a threshold?

Question

2 answers

solution1
1 2017-05-02 03:04:45

solution2
1 2017-05-03 16:42:36

How to write Spark Optimization to recalculate DataFrame when array column contains more values than a threshold?

Question

2 answers

solution1 1 2017-05-02 03:04:45

solution2 1 2017-05-03 16:42:36

solution1
1 2017-05-02 03:04:45

solution2
1 2017-05-03 16:42:36