Why is my query faster before caching my dataset in Spark?

Question

I'm trying to compare the performance of Spark queries on datasets based on Parquet files and cached dataset.

Surprisingly queries on Parquet dataset are faster than queries on cached data. I see at least 2 reasons why it should not:

cached data is in memory while parquet file isn't (it's on my SSD)
I'm expecting cached data to be in an optimized format for spark queries

I've done this small benchmark on a 300MB parquet (9M lines), only timing the query time, not the time to cache the data:

def benchmarkSum(ds: org.apache.spark.sql.DataFrame): Double = {
  var begin = System.nanoTime();
  for (int <- 1 to 1000) {
     ds.groupBy().sum("columnName").first()
  }
  return (System.nanoTime() - begin) / 1000000000.0;
}    

val pqt = spark.read.parquet("myfile.parquet");
benchmarkSum(pqt) // 54s

var cached = pqt.cache()
cached.groupBy().sum("columnName").first() // One first call to triggers the caching before benchmark.
benchmarkSum(cached) // 77s

The queries on Parquet took 54s while it took 77s on the cached dataset.
I am doing this benchmark in a spark-shell with 8 cores and 10GB memory.

So why is it slower to use cached data to sum my column? Am I doing something wrong?

Answer 1

Try .cache on first pqt statement.

So, this is what I did:

Uploaded a 78MB text file to Databricks FileStore.
Ran a modified benchmark on the standard Databricks Community Edition setup.
Modified your
```
 ds.groupBy().sum("value").first() 
```

to a simple dataframe count performed also 1000 times - see what the duplicate experts say

    df.count

I then ran the following in 2 separate runs (but not re-starting mini Cluster):

 // RUN 1
 val pqt = spark.read.text("/FileStore/tables/TTT.txt")
 benchmarkSum(pqt)

and

 // RUN 2
 val pqt = spark.read.text("/FileStore/tables/TTT.txt").cache
 benchmarkSum(pqt)

I got for Run 1: 806 and 860 seconds consecutively (ran 2x)

I got for Run 2: 51 and 50 seconds consecutively (ran 2x)

So, a slightly different approach, but whereby the .cache was put up front, but this I do not feel explains it. Except that I observe that .cache makes a marked improvement - that is different to your scenario & outcomes .

Not sure what to make of it - except that in my scenario the stuff seems to work as suggested. Could it be a bug in catalyst / tungsten under the hood optimization? I see some posts on that so now and again.

Why is my query faster before caching my dataset in Spark?

Question

1 answers

solution1
-1 2018-07-19 13:21:20

Why is my query faster before caching my dataset in Spark?

Question

1 answers

solution1 -1 2018-07-19 13:21:20

solution1
-1 2018-07-19 13:21:20