简体   繁体   English

AggregateByKey方法在Spark rdd中不起作用

[英]aggregateByKey method not working in spark rdd

Below is my sample data: 以下是我的示例数据:

1,Siddhesh,43,32000
1,Siddhesh,12,4300
2,Devil,10,1000
2,Devil,10,3000
2,Devil,11,2000

I created pair RDD to perform combineByKey and aggregateByKey operations. 我创建了一对RDD来执行combineByKeyaggregateByKey操作。 Below is my code: 下面是我的代码:

val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toLong,p(2).toString.toInt)))  

Above I paired data of first two columns as key and the remaining columns as value. 在上方,我将前两列的数据配对为键,将其余两列的数据配对为值。 Now I want only distinct values from the right tuple for 3rd column in dataset which I was able to do with the combineByKey. 现在,我只希望数据集中第3列的右元组有不同的值,而我可以使用CombineByKey进行处理。 Below is my code: 下面是我的代码:

val reduced = rd.combineByKey(
      (x:(Long,Int))=>{(x._1,Set(x._2))},
      (x:(Long,Set[Int]),y:(Long,Int))=>(x._1+y._1,x._2+y._2),
      (x:(Long,Set[Int]),y:(Long,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
      )  
scala> reduced.foreach(println)
((1,Siddhesh),(36300,Set(43, 12)))
((2,Devil),(6000,Set(10, 11)))

Now I map it so that I can get the sum of values of unique distinct keys. 现在,我对其进行映射,以便可以获得唯一的不同键的值之和。

scala> val newRdd=reduced.map(p=>(p._1._1,p._1._2,p._2._1,p._2._2.size))

scala> newRdd.foreach(println)
(1,Siddhesh,36300,2)
(2,Devil,6000,2)

Here for devil the last value is 2 since I have 10 as 2 values for 'Devil' record in the dataset and since I have had used Set it eliminates the duplicates. 在这里,对于devil来说,最后一个值是2,因为我在数据集中有10个作为“ Devil”记录的2个值,而且由于我使用了Set,因此消除了重复项。 So now I tried it with aggregateByKey . 因此,现在我尝试使用aggregateByKey尝试。 Below is my code with error: 以下是我的错误代码:

val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toString.toInt,p(2).toString.toInt)))    

I converted the value column from long to int because while initializing it was throwing error on '0' 我将value列从long转换为int,因为在初始化时它在'0'上引发错误

scala> val reducedByAggKey = rd.aggregateByKey((0,0))(
     |        (x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
     |       (x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
     | )
<console>:36: error: type mismatch;
 found   : scala.collection.immutable.Set[Int]
 required: Int
              (x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
                                                             ^
<console>:37: error: type mismatch;
 found   : scala.collection.immutable.Set[Int]
 required: Int
             (x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
                                                                  ^  

And as suggested by Leo, below is my code with error: 并如Leo所建议,以下是我的错误代码:

    scala> val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
     |   (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
     |   (x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, y._2++ x._2)
     | )
<console>:36: error: overloaded method value + with alternatives:
  (x: Double)Double <and>
  (x: Float)Float <and>
  (x: Long)Long <and>
  (x: Int)Int <and>
  (x: Char)Int <and>
  (x: Short)Int <and>
  (x: Byte)Int <and>
  (x: String)String
 cannot be applied to (Set[Int])
         (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
                                                                  ^

So where am I making mess over here ?? 那么我在哪里弄乱这里? Please correct me 请纠正我

If I understand your requirement correctly, to get the full count rather than distinct count, use List instead of Set for the aggregations. 如果我正确理解了您的要求,则要获取全部计数而不是唯一计数,请对聚合使用List而不是Set As to the problem with your aggregateByKey , it's due to the incorrect type of the zeroValue which should be (0, List.empty[Int]) (would've been (0, Set.empty[Int]) if you were to stick to using Set ): 至于你的问题aggregateByKey ,这是由于不正确类型的zeroValue应为(0, List.empty[Int])将一直(0, Set.empty[Int])如果你坚持使用Set ):

val reduced = rdd.aggregateByKey((0, List.empty[Int]))(
  (x: (Int, List[Int]), y: (Int, Int)) => (x._1 + y._1, y._2 :: x._2),
  (x: (Int, List[Int]), y: (Int, List[Int])) => (x._1 + y._1, y._2 ::: x._2)
)

reduced.collect
// res1: Array[((String, String), (Int, List[Int]))] =
//   Array(((2,Devil),(6000,List(11, 10, 10))), ((1,Siddhesh),(36300,List(12, 43))))

val newRdd = reduced.map(p => (p._1._1, p._1._2, p._2._1, p._2._2.size))

newRdd.collect
// res2: Array[(String, String, Int, Int)] =
//   Array((2,Devil,6000,3), (1,Siddhesh,36300,2))

Note that the Set to List change would apply to your combineByKey code as well if you want the full count instead of distinct count. 请注意,如果您想要全部计数而不是唯一计数,则“ SetList更改也将适用于您的combineByKey代码。

[UPDATE] [UPDATE]

For distinct count per your comment, simply stay with Set with zeroValue set to (0, Set.empty[Int]) : 为了使每个注释的计数都不同,只需SetzeroValue设置为(0, Set.empty[Int])

val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
  (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2),
  (x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, x._2 ++ y._2)
)

reduced.collect
// res3: Array[((String, String), (Int, scala.collection.immutable.Set[Int]))] =
//   Array(((2,Devil),(6000,Set(10, 11))), ((1,Siddhesh),(36300,Set(43, 12))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM