從緩存中刪除火花 dataframe

Question

我正在使用帶有 python api 的 Spark 1.3.0。 在轉換巨大的數據幀時，我緩存了許多 DF 以加快執行速度；

df1.cache()
df2.cache()

一旦使用某些 dataframe 結束並且不再需要我如何從 memory 中刪除 DF（或取消緩存它？？）？

例如， df1用於整個代碼，而df2用於少量轉換，之后就不再需要它。 我想強行刪除df2以釋放更多 memory 空間。

Answer 1

只需執行以下操作：

df1.unpersist()
df2.unpersist()

Spark 自動監控每個節點上的緩存使用情況，並以最近最少使用 (LRU) 的方式丟棄舊數據分區。 如果您想手動刪除 RDD 而不是等待它從緩存中掉出來，請使用 RDD.unpersist() 方法。

Answer 2

如果數據框注冊為 SQL 操作的表，例如

df.createGlobalTempView(tableName) // or some other way as per spark verision

然后可以使用以下命令刪除緩存， off-course spark也會自動執行

火花 >= 2.x

這里spark是SparkSession一個對象

從緩存中刪除特定表/df
 spark.catalog.uncacheTable(tableName)
從緩存中刪除所有表/dfs
 spark.catalog.clearCache()

從緩存中刪除特定表/df
 sqlContext.uncacheTable(tableName)
從緩存中刪除所有表/dfs
 sqlContext.clearCache()

Answer 3

Answer 4

這是一個簡單的實用程序上下文管理器，可以為您處理這些問題：

@contextlib.contextmanager
def cached(df):
    df_cached = df.cache()
    try:
        yield df_cached
    finally:
        df_cached.unpersist()