简体   繁体   中英

How persist works on Derived DataFrame in Scala and its performance impact

Could you please explain the effect of persisting and unpersisting a dataframe in scala with the below example? What is the effect of persist/unpersist on derived dataframes ? From the below example, i am unpersisting dcRawAll since its no longer being used. However, I read that until all actions on the derived dataframe is completed , we should not unpersist a dataframe since the cache gets deleted( or wont get created). (assume all dataframes have couple more operations on them before unpersisting).

Could you please explain the performance impact on the below query ? and what can be done to optimize it?

Thanks in advance for the help.

    val dcRawAll = dataframe.select("C1","C2","C3","C4")   //dataframe is persisted
    dcRawAll.persist()

    val statsdcRawAll = dcRawAll.count()

    val dc = dcRawAll.where(col("c1").isNotNull)

    dc.persist()
    dcRawAll.unpersist(false)

    val statsdc = dc.count()

    val dcclean = dc.where(col("c2")=="SomeValue")
    dcclean.persist()
    dc.unpersist()

Your code, as currently implemented, is not doing any caching at all. You have to remember that the .persist() method is not performing any side effect on your Dataframe , it is merely returning a new Dataframe with the capability of being persisted.

In your call to dcRawAll.persist() you aren't assigning the result, so you have no reference of the Dataframe that could be persisted. Correcting for that (very common) mistake, the caching is still not helping in the ways you hope. Below I'll comment your code explaining in further detail what is likely happening during execution.

//dcRawAll will contian a Dataframe, that will be cached after its next action
val dcRawAll = dataframe.select("C1","C2","C3","C4").persist()

//after this line, dcRawAll is calculated, then cached
val statsdcRawAll = dcRawAll.count()

//dc will contain a Dataframe that will be cached after its next action
val dc = dcRawAll.where(col("c1").isNotNull).persist()

//at this point, you've removed the dcRawAll cache never having used it
//since dc has never had an action performed yet
//if you want to make use of this cache, move the unpersist _after_ the
//dc.count()
dcRawAll.unpersist(false)

//dcRawAll is recalculated from scratch, and then dc is calculated from that
//and then cached
val statsdc = dc.count()

//dcclean will contain a dataframe that will be cached after its next action
val dcclean = dc.where(col("c2")=="SomeValue").persist()

//at this point, you've removed the dc cache having never used it
//if you perform a dcclean.count() before this, it will utilize the dc cache
//and stage the cache for dcclean, to be used on some other dcclean action
dc.unpersist()

Basically, you need to make sure not to .unpersist() a Dataframe until after any Dataframe that depends on it has had an action performed. Read this answer (and the linked documents) to better understand the difference between a transformation and action.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM