[英]How to resolve error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]?
[英]value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)] after import
我創建了這個 RDD:
scala> val data=sc.textFile("sparkdata.txt")
然后我試圖返回文件的內容:
scala> data.collect
我使用以下方法將現有數據划分為單個單詞:
scala> val splitdata = data.flatMap(line => line.split(" "));
scala> splitdata.persist()
scala> splitdata.collect;
現在,我正在做 map 減少操作:
scala> val mapdata = splitdata.map(word => (word,1));
scala> mapdata.collect;
scala> val reducedata = mapdata.reduceByKey(_+_);
要得到結果:
scala> reducedata.collect;
當我想顯示前 10 行時:
splitdata.groupByKey(identity).count().show(10)
我收到以下錯誤:
<console>:38: error: value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
splitdata.groupByKey(identity).count().show(10)
^
<console>:38: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
splitdata.groupByKey(identity).count().show(10)
^
與reduceByKey()
類似, groupByKey()
是RDD[K, V]
類型的PairRDD
的方法,而不是一般RDD
的方法。 雖然reduceByKey()
使用提供的二進制 function 將RDD[K, V]
減少到另一個RDD[K, V]
, groupByKey()
將RDD[K, V]
轉換為RDD[(K, Iterable[V])]
. 為了進一步按鍵轉換Iterable[V]
,通常會使用提供的 function 應用mapValues()
(或flatMapValues
)。
例如:
val rdd = sc.parallelize(Seq(
"apple", "apple", "orange", "banana", "banana", "orange", "apple", "apple", "orange"
))
rdd.map((_, 1)).reduceByKey(_ + _).collect
// res1: Array[(String, Int)] = Array((apple,4), (banana,2), (orange,3))
rdd.map((_, 1)).groupByKey().mapValues(_.sum).take(2)
// res2: Array[(String, Int)] = Array((apple,4), (banana,2))
如果您只對應用groupByKey()
后的組數感興趣:
rdd.map((_, 1)).groupByKey().count()
// res3: Long = 3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.