从缓存中删除火花 dataframe

Question

我正在使用带有 python api 的 Spark 1.3.0。 在转换巨大的数据帧时，我缓存了许多 DF 以加快执行速度；

df1.cache()
df2.cache()

一旦使用某些 dataframe 结束并且不再需要我如何从 memory 中删除 DF（或取消缓存它？？）？

例如， df1用于整个代码，而df2用于少量转换，之后就不再需要它。 我想强行删除df2以释放更多 memory 空间。

Answer 1

只需执行以下操作：

df1.unpersist()
df2.unpersist()

Spark 自动监控每个节点上的缓存使用情况，并以最近最少使用 (LRU) 的方式丢弃旧数据分区。 如果您想手动删除 RDD 而不是等待它从缓存中掉出来，请使用 RDD.unpersist() 方法。

Answer 2

如果数据框注册为 SQL 操作的表，例如

df.createGlobalTempView(tableName) // or some other way as per spark verision

然后可以使用以下命令删除缓存， off-course spark也会自动执行

火花 >= 2.x

这里spark是SparkSession一个对象

从缓存中删除特定表/df
 spark.catalog.uncacheTable(tableName)
从缓存中删除所有表/dfs
 spark.catalog.clearCache()

从缓存中删除特定表/df
 sqlContext.uncacheTable(tableName)
从缓存中删除所有表/dfs
 sqlContext.clearCache()

Answer 3

Answer 4

这是一个简单的实用程序上下文管理器，可以为您处理这些问题：

@contextlib.contextmanager
def cached(df):
    df_cached = df.cache()
    try:
        yield df_cached
    finally:
        df_cached.unpersist()