使用reduceByKey时，scala中的类型不匹配

Question

I have separately test my error code in scala shell 我已经在scala shell中分别测试了我的错误代码

scala> val p6 = sc.parallelize(List( ("a","b"),("b","c")))
p6: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> val p7 = p6.map(a => ((a._1+a._2), (a._1, a._2, 1)))
p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))] = MapPartitionsRDD[11] at map at <console>:26

scala> val p8 = p7.reduceByKey( (a,b) => (a._1,(a._2, a._3+b._3)))
<console>:28: error: type mismatch;
 found   : (String, (String, Int))
 required: (String, String, Int)
       val p8 = p7.reduceByKey( (a,b) => (a._1,(a._2, a._3+b._3)))

I want to use a._1 as the key so that I can further use join operator, and it is required to be (key, value) pairs. 我想使用a._1作为密钥，这样我就可以进一步使用join运算符，并且它必须是（密钥，值）对。 But my question is, why there is a required type while I am using reducing function? 但是我的问题是，为什么我在使用归约函数时required一个required类型？ I think the format is set by ourselves instead of something regulated. 我认为格式是由我们自己设定的，而不是受管制的。 Am I wrong? 我错了吗？

Also, if I am wrong, then why it is (String, String, Int) required? 另外，如果我错了，那么为什么需要(String, String, Int) ？ Why it is not something else? 为什么不是别的东西？

ps: I know (String, String, Int) is the value type in (a._1+a._2), (a._1, a._2, 1)) which is the map function, but the official example shows that the reduce funtion (a, b) => (a._1 + b._1, a._2 + b._2) is valid. ps：我知道(String, String, Int)是(a._1+a._2), (a._1, a._2, 1)) (String, String, Int)中的值类型，但它是map函数，但是官方示例显示减少功能(a, b) => (a._1 + b._1, a._2 + b._2)有效。 And I think all of these including my code above should be valid 我认为所有这些，包括我上面的代码，都应该有效

Answer 1

Take a look at the types. 看看类型。 Reduce by key is method on RDD[(K, V)] with signature: 通过密钥减少是带有签名的RDD[(K, V)]上的方法：

def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

In other words, both input arguments and the return argument have to be of the same type. 换句话说，输入参数和返回参数必须具有相同的类型。

In your case p7 is 你的情况p7是

RDD[(String, (String, String, Int))]

where K is String and V is (String, String, Int) , so the function used with reduceByKey must be 其中K是String ， V是(String, String, Int) ，因此与reduceByKey一起使用的函数必须为

((String, String, Int), (String, String, Int)) => (String, String, Int)

A valid function would be: 有效函数为：

p7.reduceByKey( (a,b) => (a._1, a._2, a._3 + b._3))

which would give you 这会给你

(bc,(b,c,1))
(ab,(a,b,1))

as a result. 结果是。

If you want to change the type in byKey method you have to use aggregateByKey or combineByKey . 如果要在byKey方法中更改类型， byKey必须使用aggregateByKey或combineByKey 。

Answer 2

your p7 is of p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))] but in your reduceByKey you have used (a._1,(a._2, a._3+b._3)) which is of type (String, (String, Int)) 您的p7属于p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))]但在您的reduceByKey您已经使用过(a._1,(a._2, a._3+b._3))类型(String, (String, Int))

The output type of p8 should also be p8: org.apache.spark.rdd.RDD[(String, (String, String, Int))] p8的输出类型也应该是p8: org.apache.spark.rdd.RDD[(String, (String, String, Int))]

so defining like the following should work for you 所以像下面这样定义应该对你有用

val p8 = p7.reduceByKey( (a,b) => (a._1, a._2, a._3+b._3))

You can read my answer in pyspark for more detail on how reduceByKey works 您可以在pyspark中阅读我的答案，以了解有关reduceByKey如何工作的更多详细信息

and this one should help too 这也应该有所帮助

使用reduceByKey时，scala中的类型不匹配

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-05-03 12:37:30

解决方案2
1 2018-05-03 12:37:41

使用reduceByKey时，scala中的类型不匹配

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-05-03 12:37:30

解决方案2 1 2018-05-03 12:37:41

解决方案1
2 已采纳 2018-05-03 12:37:30

解决方案2
1 2018-05-03 12:37:41