简体   繁体   中英

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;

df1.cache()
df2.cache()

Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?

For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.

just do the following:

df1.unpersist()
df2.unpersist()

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

If the dataframe registered as a table for SQL operations , like

df.createGlobalTempView(tableName) // or some other way as per spark verision

then the cache can be dropped with following commands, off-course spark also does it automatically

Spark >= 2.x

Here spark is an object of SparkSession

  • Drop a specific table/df from cache

     spark.catalog.uncacheTable(tableName)
  • Drop all tables/dfs from cache

     spark.catalog.clearCache()

Spark <= 1.6.x

  • Drop a specific table/df from cache

     sqlContext.uncacheTable(tableName)
  • Drop all tables/dfs from cache

     sqlContext.clearCache()
  1. If you need to block during removal => df2.unpersist(true)
  2. Unblocking removal => df2.unpersist()

Here is a simple utility context manager that takes care of that for you:

@contextlib.contextmanager
def cached(df):
    df_cached = df.cache()
    try:
        yield df_cached
    finally:
        df_cached.unpersist()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM