如果我在Spark中緩存兩次相同的RDD會發生什么

Question

我正在構建一個通用函數，它接收RDD並對其進行一些計算。 由於我在輸入RDD上運行多個計算，我想緩存它。 例如：

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

我的問題是，因為r是給我的，它可能已經或可能沒有被緩存。 如果它被緩存並且我再次調用緩存，那么spark會創建一個新的緩存層，這意味着在計算t1和t2 ，我將在緩存中有兩個r實例？ 或者火花是否意識到r被緩存並將忽略它？

Answer 1

沒什么 。 如果在緩存的RDD上調用cache ，則不會發生任何事情，RDD將被緩存（一次）。 像許多其他轉換一樣，緩存是懶惰的：

當你調用cache中，RDD的storageLevel設置為MEMORY_ONLY
再次調用cache時，它設置為相同的值（無更改）
在評估時，當底層RDD實現時，Spark將檢查RDD的storageLevel ，如果它需要緩存，它將緩存它。

所以你很安全。

Answer 2

只是在我的集群上測試，Zohar是對的，沒有任何反應，它只會緩存RDD一次。 我認為，原因是每個RDD內部都有一個id ，spark會使用id來標記RDD是否已被緩存。 因此，多次緩存一個RDD將無能為力。

下面是我的代碼和截圖：

更新[根據需要添加代碼]

### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                 .setName("raw_file")\
                 .cache()
raw_file.count()

### try to cache and count again, then take a look at the WEB UI, nothing changes

raw_file.cache()
raw_file.count()

### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code

raw_file.setName("raw_file_2")
raw_file.cache().count()

如果我在Spark中緩存兩次相同的RDD會發生什么

問題描述

2 個解決方案

解決方案1
13 已采納 2016-03-24 07:51:53

解決方案2
2 2016-03-24 08:22:53

如果我在Spark中緩存兩次相同的RDD會發生什么

問題描述

2 個解決方案

解決方案1 13 已采納 2016-03-24 07:51:53

解決方案2 2 2016-03-24 08:22:53

解決方案1
13 已采納 2016-03-24 07:51:53

解決方案2
2 2016-03-24 08:22:53