導入后值 reduceByKey 不是 org.apache.spark.rdd.RDD[(Int, Int)] 的成員

Question

我創建了這個 RDD：

scala> val data=sc.textFile("sparkdata.txt")

然后我試圖返回文件的內容：

scala> data.collect

我使用以下方法將現有數據划分為單個單詞：

scala> val splitdata = data.flatMap(line => line.split(" "));
scala> splitdata.persist()
scala> splitdata.collect;

現在，我正在做 map 減少操作：

scala> val mapdata = splitdata.map(word => (word,1));
scala> mapdata.collect;
scala> val reducedata = mapdata.reduceByKey(_+_);

要得到結果：

scala> reducedata.collect;

當我想顯示前 10 行時：

splitdata.groupByKey(identity).count().show(10)

我收到以下錯誤：

<console>:38: error: value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
       splitdata.groupByKey(identity).count().show(10)
                 ^
<console>:38: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
       splitdata.groupByKey(identity).count().show(10)
                            ^

Answer 1

與reduceByKey()類似， groupByKey()是RDD[K, V]類型的PairRDD的方法，而不是一般RDD的方法。 雖然reduceByKey()使用提供的二進制 function 將RDD[K, V]減少到另一個RDD[K, V] ， groupByKey()將RDD[K, V]轉換為RDD[(K, Iterable[V])] . 為了進一步按鍵轉換Iterable[V] ，通常會使用提供的 function 應用mapValues() （或flatMapValues ）。

例如：

val rdd = sc.parallelize(Seq(
  "apple", "apple", "orange", "banana", "banana", "orange", "apple", "apple", "orange"
))

rdd.map((_, 1)).reduceByKey(_ + _).collect
// res1: Array[(String, Int)] = Array((apple,4), (banana,2), (orange,3))

rdd.map((_, 1)).groupByKey().mapValues(_.sum).take(2)
// res2: Array[(String, Int)] = Array((apple,4), (banana,2))

如果您只對應用groupByKey()后的組數感興趣：

rdd.map((_, 1)).groupByKey().count()
// res3: Long = 3

導入后值 reduceByKey 不是 org.apache.spark.rdd.RDD[(Int, Int)] 的成員

問題描述

1 個解決方案

解決方案1
1 已采納 2021-04-12 03:21:06

導入后值 reduceByKey 不是 org.apache.spark.rdd.RDD[(Int, Int)] 的成員

問題描述

1 個解決方案

解決方案1 1 已采納 2021-04-12 03:21:06

解決方案1
1 已采納 2021-04-12 03:21:06