簡體   English   中英

導入后值 reduceByKey 不是 org.apache.spark.rdd.RDD[(Int, Int)] 的成員

[英]value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)] after import

我創建了這個 RDD:

scala> val data=sc.textFile("sparkdata.txt")

然后我試圖返回文件的內容:

scala> data.collect

我使用以下方法將現有數據划分為單個單詞:

scala> val splitdata = data.flatMap(line => line.split(" "));
scala> splitdata.persist()
scala> splitdata.collect;

現在,我正在做 map 減少操作:

scala> val mapdata = splitdata.map(word => (word,1));
scala> mapdata.collect;
scala> val reducedata = mapdata.reduceByKey(_+_);

要得到結果:

scala> reducedata.collect;

當我想顯示前 10 行時:

splitdata.groupByKey(identity).count().show(10)

我收到以下錯誤:

<console>:38: error: value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
       splitdata.groupByKey(identity).count().show(10)
                 ^
<console>:38: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
       splitdata.groupByKey(identity).count().show(10)
                            ^

reduceByKey()類似, groupByKey()RDD[K, V]類型的PairRDD的方法,而不是一般RDD的方法。 雖然reduceByKey()使用提供的二進制 function 將RDD[K, V]減少到另一個RDD[K, V]groupByKey()RDD[K, V]轉換為RDD[(K, Iterable[V])] . 為了進一步按鍵轉換Iterable[V] ,通常會使用提供的 function 應用mapValues() (或flatMapValues )。

例如:

val rdd = sc.parallelize(Seq(
  "apple", "apple", "orange", "banana", "banana", "orange", "apple", "apple", "orange"
))

rdd.map((_, 1)).reduceByKey(_ + _).collect
// res1: Array[(String, Int)] = Array((apple,4), (banana,2), (orange,3))

rdd.map((_, 1)).groupByKey().mapValues(_.sum).take(2)
// res2: Array[(String, Int)] = Array((apple,4), (banana,2))

如果您只對應用groupByKey()后的組數感興趣:

rdd.map((_, 1)).groupByKey().count()
// res3: Long = 3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM