針對JavaRDD的每個操作的Apache Spark計時

Question

問題：這是測試構建RDD所需時間的有效方法嗎？

我在這里做兩件事。 基本方法是，我們有M個實例（稱為DropEvaluation）和N個DropResults。 我們需要將每個N DropResult與每個M DropEvaluations進行比較。 每個M必須看到每個N，以便最終獲得M個結果。

如果在構建RDD之后不使用.count（），驅動程序將繼續執行下一行代碼，並說幾乎沒有時間來構建需要30分鍾才能構建的RDD。

我只是想確保自己沒有丟失任何東西，例如.count（）可能需要很長時間？ 我想計時.count（），我必須修改Spark的源代碼？

M = 1000或2000。N = 10 ^ 7。

這實際上是笛卡爾問題-選擇累加器是因為我們需要在適當的位置寫入每個M。 建立完整的笛卡爾RDD也很丑陋。

我們建立了一個M累加器列表（不能用Java做一個列表累加器嗎？）。 然后，我們使用foreach遍歷RDD中的N個值。

澄清問題：正確計算了總時間，我問RDD上的.count（）是否強制Spark等待RDD完成才可以進行計數。 .count（）時間重要嗎？

這是我們的代碼：

// assume standin exists and does it's thing correctly

// this controls the final size of RDD, as we are not parallelizing something with an existing length
List<Integer> rangeN = IntStream.rangeClosed(simsLeft - blockSize + 1, simsLeft).boxed().collect(Collectors.toList());

// setup bogus array of size N for parallelize dataSetN to lead to dropResultsN       
JavaRDD<Integer> dataSetN = context.parallelize(rangeN);

// setup timing to create N
long NCreationStartTime = System.nanoTime();

// this maps each integer element of RDD dataSetN to a "geneDropped" chromosome simulation, we need N of these:
JavaRDD<TholdDropResult> dropResultsN = dataSetN.map(s -> standin.call(s)).persist(StorageLevel.MEMORY_ONLY());

// **** this line makes the driver wait until the RDD is done, right?
long dummyLength = dropResultsN.count();


long NCreationNanoSeconds = System.nanoTime() - NCreationStartTime;
double NCreationSeconds = (double)NCreationNanoSeconds / 1000000000.0;
double NCreationMinutes = NCreationSeconds / 60.0;

logger.error("{} test sims remaining", simsLeft);

// now get the time for just the dropComparison (part of accumulable's add)
long startDropCompareTime = System.nanoTime();

// here we iterate through each accumulator in the list and compare all N elements of dropResultsN RDD to each M in turn, our .add() is a custom AccumulableParam
for (Accumulable<TholdDropTuple, TholdDropResult> dropEvalAccum : accumList) {
    dropResultsN.foreach(new VoidFunction<TholdDropResult>() {
                    @Override
                    public void call(TholdDropResult dropResultFromN) throws Exception {
                            dropEvalAccum.add(dropResultFromN);
                    }
                });
            }

    // all the dropComparisons for all N to all M for this blocksize are done, check the time...
   long dropCompareNanoSeconds = System.nanoTime() - startDropCompareTime;
   double dropCompareSeconds = (double)dropCompareNanoSeconds / 1000000000.0;
    double dropCompareMinutes = dropCompareSeconds / 60.0;

    // write lines to indicate timing section
    // log and write to file the time for the N-creation

    ...

} // end for that goes through dropAccumList

Answer 1

Spark程序是惰性的，只有在調用RDD上的count之類的所有操作后，它才會運行。 您可以在Spark的文檔中找到常見操作的列表

// **** this line makes the driver wait until the RDD is done, right?
long dummyLength = dropResultsN.count();

是的，在這種情況下， count強制計算dropResultsN ，因此將花費很長時間。 如果您進行第二次count ，由於RDD已被計算和緩存，它將很快返回。

針對JavaRDD的每個操作的Apache Spark計時

問題描述

1 個解決方案

解決方案1
1 2016-07-11 10:13:34

針對JavaRDD的每個操作的Apache Spark計時

問題描述

1 個解決方案

解決方案1 1 2016-07-11 10:13:34

解決方案1
1 2016-07-11 10:13:34