[英]Different behavior when using Spark REPL and standalone Spark program
當我通過Spark REPL運行此代碼時:
val sc = new SparkContext("local[4]" , "")
val x = sc.parallelize(List( ("a" , "b" , 1) , ("a" , "b" , 1) , ("c" , "b" , 1) , ("a" , "d" , 1)))
val byKey = x.map({case (sessionId,uri,count) => (sessionId,uri)->count})
val reducedByKey = byKey.reduceByKey(_ + _ , 2)
val grouped = byKey.groupByKey
val count = grouped.map{case ((sessionId,uri),count) => ((sessionId),(uri,count.sum))}
val grouped2 = count.groupByKey
REPL將grouped2的類型顯示為:
grouped2: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])]
但是,如果我在Spark程序中使用相同的代碼,則會為grouped2返回不同的類型,如此錯誤所示:
type mismatch;
found : org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])]
required: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])]
Note: (String, Iterable[(String, Int)]) >: (String, Seq[(String, Int)]), but class RDD is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
val grouped2 : org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey
這是獨立模式的完整代碼:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
object Tester extends App {
val sc = new SparkContext("local[4]" , "")
val x = sc.parallelize(List( ("a" , "b" , 1) , ("a" , "b" , 1) , ("c" , "b" , 1) , ("a" , "d" , 1)))
val byKey = x.map({case (sessionId,uri,count) => (sessionId,uri)->count})
val reducedByKey = byKey.reduceByKey(_ + _ , 2)
val grouped = byKey.groupByKey
val count = grouped.map{case ((sessionId,uri),count) => ((sessionId),(uri,count.sum))}
val grouped2 : org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey
}
REPL和Standalone中返回的類型應該不相等嗎?
更新:在獨立組中,grouped2被推斷為RDD[(String, Iterable[Nothing])]
所以val grouped2: RDD[(String, Iterable[Nothing])] = count.groupByKey compiles
。
因此,根據程序的運行方式,將返回三種可能的類型?
更新2:IntelliJ似乎錯誤地推斷類型:
val x : org.apache.spark.rdd.RDD[(String, (String, Int))] = sc.parallelize(List( ("a" , ("b" , 1)) , ("a" , ("b" , 1))))
val grouped = x.groupByKey()
IntelliJ推斷grouped
為org.apache.spark.rdd.RDD[(String, Iterable[Nothing])]
何時應為org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])]
(Spark REPL 1.0版會對此進行推斷)
為了完整起見:Spark API在0.9和1.0之間更改,並且groupByKey
現在返回一個以Iterable
作為其第二個成員的對,而不是Seq
。
在IntelliJ問題上,不幸的是,混淆IntelliJ的類型推斷並不難。 如果Nothing
那很可能是錯誤的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.