Apache Spark：為什么reduceByKey轉換會執行DAG？

Question

我面臨一個奇怪的問題。 據我所知，Spark中的操作DAG僅在執行操作時執行。 但是，我可以看到reduceByKey（）opertation（是一個轉換）開始執行DAG。

重現步驟。 嘗試下面的代碼

SparkConf conf =new SparkConf().setMaster("local").setAppName("Test");
JavaSparkContext context=new JavaSparkContext(conf);

JavaRDD<String> textFile = context.textFile("any non-existing path"); // This path should not exist

JavaRDD<String> flatMap = textFile.flatMap(x -> Arrays.asList(x.split(" ")).iterator());
JavaPairRDD<String, Integer> mapToPair = flatMap.mapToPair(x -> new Tuple2<String, Integer>((String) x, 1));

注意：文件的路徑不應該是任何現有路徑。 換句話說，文件不應該存在。

如果執行此代碼，則沒有按預期發生。但是，如果將以下行添加到程序並執行

mapToPair.reduceByKey((x, y) -> x + y);

它給出了以下例外：

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

這意味着它已經開始執行DAG。 由於reduceByKey（）是一個轉換，因此在執行諸如collect（）或take（）之類的操作之前不應該這樣。

Spark版本：2.0.0。 請提供您的建議。

Answer 1

這是因為，實際上不是DAG被執行（如：它的整個物化）。

會發生什么是reduceByKey 需要分區程序才能工作。 如果您不提供一個，Spark會根據約定和默認值創建一個。 “默認partiionner”在代碼中作為以下注釋：

/**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If any of the RDDs already has a partitioner, choose that one.
*
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
*
* Unless spark.default.parallelism is set, the number of partitions will be the
* same as the number of partitions in the largest upstream RDD, as this should
* be least likely to cause out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/

該定義意味着，在某些情況下，計算來自所有上游RDD的分區數。 在您的情況下，這意味着要求“文件系統”（可能是Hadoop，可能是本地的，......）來執行任何必要的操作（例如，對Hadoop文件系統的單個調用可以返回多個文件，每個文件也可以分割根據其InputFormat定義的各種優化，只能通過實際查找它們才能知道。

這就是在這里執行的內容，而不是實際的DAG（例如;你的map / flatMap / aggregate，......）。

您可以通過在此按鍵變量中提供自己的分區程序來避免它：

 reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

Apache Spark：為什么reduceByKey轉換會執行DAG？

問題描述

1 個解決方案

解決方案1
2 已采納 2017-02-07 16:42:32

Apache Spark：為什么reduceByKey轉換會執行DAG？

問題描述

1 個解決方案

解決方案1 2 已采納 2017-02-07 16:42:32

解決方案1
2 已采納 2017-02-07 16:42:32