简体   繁体   中英

Scala Spark/Databricks: .cache() not preventing re-calculation

This involves some complexity, and I may be unclear on some basics. Here goes:

As I understand it, spark has "transformations" and "actions". Transformations lazily build up a description of what you want to do, and actions make it happen. This can improve performance (allowing optimized plans), or it can lead to duplicated effort if you use multiple actions on a single dataframe, causing the transform to fire repeatedly. To avoid this, .cache() tells Spark to actually "save its work", so the dataframe you call it on shouldn't continue to be re-computed.

My problem is that it doesn't seem to be doing that. I have a function "Foo" that does a lot of computation to produce a (very small) dataframe. Foo runs quickly, and I can display the result. I have another function "Bar" that does a bunch of actions on a dataframe. Bar runs quickly on the (large) original input, but very slowly on the output of foo, even cached and coalesced. I can also "force" the cache-ing by writing the output of foo to disk and then re-reading it, at which point bar runs quickly:

display(bar(bigDF)) //Fast!

val profile = foo(bigDF).coalesce(1).cache()
display(profile) //Also fast! (and profile only has 2 rows, ~80 columns)

display(bar(profile)) //Slow!

profile
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("filename.csv")
val dumb = spark.read.format("csv").option("header", "true").load("filename.csv")
display(bar(dumb)) //Fast again

To me, this says that the .cache() isn't working the way I think it does - the slow call is repeatedly re-calling the transformations in foo, unless I write it to disk and force it to "forget" its history. Can someone explain what I'm missing?

cache is doing what you expect, seems that there is something strange happening.

I expect that the coalesce(1) is the problem, try to leave that away and test if it's running faster. It could be that it destroys parallelism for bar .

If nothing helps, try to use checkpoint instead of cache it could be that the query plan is very long and complex, checkpoint would truncate that (it writes to disk)

For further analysis you would need to go into SparkUI to analyze the jobs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM