I'm trying to compare the performance of Spark queries on datasets based on Parquet files and cached dataset.
Surprisingly queries on Parquet dataset are faster than queries on cached data. I see at least 2 reasons why it should not:
I've done this small benchmark on a 300MB parquet (9M lines), only timing the query time, not the time to cache the data:
def benchmarkSum(ds: org.apache.spark.sql.DataFrame): Double = {
var begin = System.nanoTime();
for (int <- 1 to 1000) {
ds.groupBy().sum("columnName").first()
}
return (System.nanoTime() - begin) / 1000000000.0;
}
val pqt = spark.read.parquet("myfile.parquet");
benchmarkSum(pqt) // 54s
var cached = pqt.cache()
cached.groupBy().sum("columnName").first() // One first call to triggers the caching before benchmark.
benchmarkSum(cached) // 77s
The queries on Parquet took 54s while it took 77s on the cached dataset.
I am doing this benchmark in a spark-shell
with 8 cores and 10GB memory.
So why is it slower to use cached data to sum my column? Am I doing something wrong?
Try .cache on first pqt statement.
So, this is what I did:
Modified your
ds.groupBy().sum("value").first()
to a simple dataframe count performed also 1000 times - see what the duplicate experts say
df.count
I then ran the following in 2 separate runs (but not re-starting mini Cluster):
// RUN 1
val pqt = spark.read.text("/FileStore/tables/TTT.txt")
benchmarkSum(pqt)
and
// RUN 2
val pqt = spark.read.text("/FileStore/tables/TTT.txt").cache
benchmarkSum(pqt)
I got for Run 1: 806 and 860 seconds consecutively (ran 2x)
I got for Run 2: 51 and 50 seconds consecutively (ran 2x)
So, a slightly different approach, but whereby the .cache was put up front, but this I do not feel explains it. Except that I observe that .cache makes a marked improvement - that is different to your scenario & outcomes .
Not sure what to make of it - except that in my scenario the stuff seems to work as suggested. Could it be a bug in catalyst / tungsten under the hood optimization? I see some posts on that so now and again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.