帶有緩存和操作的奇怪Spark行為

Question

我一直想找出為什么在執行某些Spark工作時會出現奇怪的行為。 如果我在緩存DataFrame之后或在將數據幀寫回到hdfs之前放置一個動作（ .show(1)方法），則作業將出錯。

這里有一個非常類似於SO的帖子：

Spark SQL SaveMode.Overwrite，獲取java.io.FileNotFoundException並需要'REFRESH TABLE tableName' 。

基本上，另一篇文章解釋說，當您從要寫入的同一HDFS目錄中讀取並且SaveMode為"overwrite" ，您將得到一個java.io.FileNotFoundException 。

但是在這里，我發現僅將動作移到程序中的什么位置就可以得出截然不同的結果-完成程序或給出此異常。

我想知道是否有人可以解釋為什么Spark在這里不一致？

 val myDF = spark.read.format("csv")
    .option("header", "false")
    .option("delimiter", "\t")
    .schema(schema)
    .load(myPath)

// If I cache it here or persist it then do an action after the cache, it will occasionally 
// not throw the error. This is when completely restarting the SparkSession so there is no
// risk of another user interfering on the same JVM.

      myDF.cache()
      myDF.show(1)

// Just an example.
// Many different transformations are then applied...

val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)

val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)

// Below is the same .show(1) action call as was previously done, only this below
// action ALWAYS results in a successful completion and the above .show(1) sometimes results
// in FileNotFoundException and sometimes results in successful completion. The only
// thing that changes among test runs is only one is executed. Either
// fourthDF.show(1) or myDF.show(1) is left commented out

fourthDF.show(1)
fourthDF.write
    .mode(writeMode)
    .option("header", "false")
    .option("delimiter", "\t")
    .csv(myPath)

Answer 1

嘗試使用count而不是show(1) ，我相信問題是由於Spark試圖變得聰明而不是不加載整個數據幀（因為show並不需要一切）。 運行count強制Spark加載並正確緩存所有數據，這有望消除不一致的情況。

Answer 2

Spark僅按需實現rdds，大多數操作都需要讀取DF的所有分區，例如count（），但是take（）和first（）之類的操作並不需要所有分區。

在您的情況下，它需要一個分區，因此只有一個分區可以實現和緩存。 然后，當您執行count（）時，所有分區都需要具體化並緩存到可用內存允許的范圍內。

帶有緩存和操作的奇怪Spark行為

問題描述

2 個解決方案

解決方案1
2 已采納 2017-11-22 07:25:20

解決方案2
0 2017-11-22 07:29:57

帶有緩存和操作的奇怪Spark行為

問題描述

2 個解決方案

解決方案1 2 已采納 2017-11-22 07:25:20

解決方案2 0 2017-11-22 07:29:57

解決方案1
2 已采納 2017-11-22 07:25:20

解決方案2
0 2017-11-22 07:29:57