[英]type mismatch in scala while using reduceByKey
I have separately test my error code in scala shell 我已经在scala shell中分别测试了我的错误代码
scala> val p6 = sc.parallelize(List( ("a","b"),("b","c")))
p6: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val p7 = p6.map(a => ((a._1+a._2), (a._1, a._2, 1)))
p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))] = MapPartitionsRDD[11] at map at <console>:26
scala> val p8 = p7.reduceByKey( (a,b) => (a._1,(a._2, a._3+b._3)))
<console>:28: error: type mismatch;
found : (String, (String, Int))
required: (String, String, Int)
val p8 = p7.reduceByKey( (a,b) => (a._1,(a._2, a._3+b._3)))
I want to use a._1
as the key so that I can further use join
operator, and it is required to be (key, value) pairs. 我想使用
a._1
作为密钥,这样我就可以进一步使用join
运算符,并且它必须是(密钥,值)对。 But my question is, why there is a required
type while I am using reducing function? 但是我的问题是,为什么我在使用归约函数时
required
一个required
类型? I think the format is set by ourselves instead of something regulated. 我认为格式是由我们自己设定的,而不是受管制的。 Am I wrong?
我错了吗?
Also, if I am wrong, then why it is (String, String, Int)
required? 另外,如果我错了,那么为什么需要
(String, String, Int)
? Why it is not something else? 为什么不是别的东西?
ps: I know (String, String, Int)
is the value type in (a._1+a._2), (a._1, a._2, 1))
which is the map function, but the official example shows that the reduce funtion (a, b) => (a._1 + b._1, a._2 + b._2)
is valid. ps:我知道
(String, String, Int)
是(a._1+a._2), (a._1, a._2, 1))
(String, String, Int)
中的值类型,但它是map函数,但是官方示例显示减少功能(a, b) => (a._1 + b._1, a._2 + b._2)
有效。 And I think all of these including my code above should be valid 我认为所有这些,包括我上面的代码,都应该有效
Take a look at the types. 看看类型。 Reduce by key is method on
RDD[(K, V)]
with signature: 通过密钥减少是带有签名的
RDD[(K, V)]
上的方法:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
In other words, both input arguments and the return argument have to be of the same type. 换句话说,输入参数和返回参数必须具有相同的类型。
In your case p7
is 你的情况
p7
是
RDD[(String, (String, String, Int))]
where K
is String
and V
is (String, String, Int)
, so the function used with reduceByKey
must be 其中
K
是String
, V
是(String, String, Int)
,因此与reduceByKey
一起使用的函数必须为
((String, String, Int), (String, String, Int)) => (String, String, Int)
A valid function would be: 有效函数为:
p7.reduceByKey( (a,b) => (a._1, a._2, a._3 + b._3))
which would give you 这会给你
(bc,(b,c,1))
(ab,(a,b,1))
as a result. 结果是。
If you want to change the type in byKey
method you have to use aggregateByKey
or combineByKey
. 如果要在
byKey
方法中更改类型, byKey
必须使用aggregateByKey
或combineByKey
。
your p7
is of p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))]
but in your reduceByKey
you have used (a._1,(a._2, a._3+b._3))
which is of type (String, (String, Int))
您的
p7
属于p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))]
但在您的reduceByKey
您已经使用过(a._1,(a._2, a._3+b._3))
类型(String, (String, Int))
The output type of p8
should also be p8: org.apache.spark.rdd.RDD[(String, (String, String, Int))]
p8
的输出类型也应该是p8: org.apache.spark.rdd.RDD[(String, (String, String, Int))]
so defining like the following should work for you 所以像下面这样定义应该对你有用
val p8 = p7.reduceByKey( (a,b) => (a._1, a._2, a._3+b._3))
You can read my answer in pyspark for more detail on how reduceByKey works 您可以在pyspark中阅读我的答案,以了解有关reduceByKey如何工作的更多详细信息
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.