简体   繁体   English

AggregateByKey方法在Spark rdd中不起作用

[英]aggregateByKey method not working in spark rdd

Below is my sample data: 以下是我的示例数据:


I created pair RDD to perform combineByKey and aggregateByKey operations. 我创建了一对RDD来执行combineByKeyaggregateByKey操作。 Below is my code: 下面是我的代码:

val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toLong,p(2).toString.toInt)))  

Above I paired data of first two columns as key and the remaining columns as value. 在上方,我将前两列的数据配对为键,将其余两列的数据配对为值。 Now I want only distinct values from the right tuple for 3rd column in dataset which I was able to do with the combineByKey. 现在,我只希望数据集中第3列的右元组有不同的值,而我可以使用CombineByKey进行处理。 Below is my code: 下面是我的代码:

val reduced = rd.combineByKey(
scala> reduced.foreach(println)
((1,Siddhesh),(36300,Set(43, 12)))
((2,Devil),(6000,Set(10, 11)))

Now I map it so that I can get the sum of values of unique distinct keys. 现在,我对其进行映射,以便可以获得唯一的不同键的值之和。

scala> val newRdd=reduced.map(p=>(p._1._1,p._1._2,p._2._1,p._2._2.size))

scala> newRdd.foreach(println)

Here for devil the last value is 2 since I have 10 as 2 values for 'Devil' record in the dataset and since I have had used Set it eliminates the duplicates. 在这里,对于devil来说,最后一个值是2,因为我在数据集中有10个作为“ Devil”记录的2个值,而且由于我使用了Set,因此消除了重复项。 So now I tried it with aggregateByKey . 因此,现在我尝试使用aggregateByKey尝试。 Below is my code with error: 以下是我的错误代码:

val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toString.toInt,p(2).toString.toInt)))    

I converted the value column from long to int because while initializing it was throwing error on '0' 我将value列从long转换为int,因为在初始化时它在'0'上引发错误

scala> val reducedByAggKey = rd.aggregateByKey((0,0))(
     |        (x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
     |       (x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
     | )
<console>:36: error: type mismatch;
 found   : scala.collection.immutable.Set[Int]
 required: Int
<console>:37: error: type mismatch;
 found   : scala.collection.immutable.Set[Int]
 required: Int

And as suggested by Leo, below is my code with error: 并如Leo所建议,以下是我的错误代码:

    scala> val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
     |   (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
     |   (x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, y._2++ x._2)
     | )
<console>:36: error: overloaded method value + with alternatives:
  (x: Double)Double <and>
  (x: Float)Float <and>
  (x: Long)Long <and>
  (x: Int)Int <and>
  (x: Char)Int <and>
  (x: Short)Int <and>
  (x: Byte)Int <and>
  (x: String)String
 cannot be applied to (Set[Int])
         (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),

So where am I making mess over here ?? 那么我在哪里弄乱这里? Please correct me 请纠正我

If I understand your requirement correctly, to get the full count rather than distinct count, use List instead of Set for the aggregations. 如果我正确理解了您的要求,则要获取全部计数而不是唯一计数,请对聚合使用List而不是Set As to the problem with your aggregateByKey , it's due to the incorrect type of the zeroValue which should be (0, List.empty[Int]) (would've been (0, Set.empty[Int]) if you were to stick to using Set ): 至于你的问题aggregateByKey ,这是由于不正确类型的zeroValue应为(0, List.empty[Int])将一直(0, Set.empty[Int])如果你坚持使用Set ):

val reduced = rdd.aggregateByKey((0, List.empty[Int]))(
  (x: (Int, List[Int]), y: (Int, Int)) => (x._1 + y._1, y._2 :: x._2),
  (x: (Int, List[Int]), y: (Int, List[Int])) => (x._1 + y._1, y._2 ::: x._2)

// res1: Array[((String, String), (Int, List[Int]))] =
//   Array(((2,Devil),(6000,List(11, 10, 10))), ((1,Siddhesh),(36300,List(12, 43))))

val newRdd = reduced.map(p => (p._1._1, p._1._2, p._2._1, p._2._2.size))

// res2: Array[(String, String, Int, Int)] =
//   Array((2,Devil,6000,3), (1,Siddhesh,36300,2))

Note that the Set to List change would apply to your combineByKey code as well if you want the full count instead of distinct count. 请注意,如果您想要全部计数而不是唯一计数,则“ SetList更改也将适用于您的combineByKey代码。


For distinct count per your comment, simply stay with Set with zeroValue set to (0, Set.empty[Int]) : 为了使每个注释的计数都不同,只需SetzeroValue设置为(0, Set.empty[Int])

val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
  (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2),
  (x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, x._2 ++ y._2)

// res3: Array[((String, String), (Int, scala.collection.immutable.Set[Int]))] =
//   Array(((2,Devil),(6000,Set(10, 11))), ((1,Siddhesh),(36300,Set(43, 12))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM