[英]spark scala type mismatch with zipwithIndex in groupbykey
I am trying to test groupByKey to find nth highest score of a subject 我正在尝试测试groupByKey以找到该主题的第n个最高分
my data looks like this 我的数据看起来像这样
scala> a
res176: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[263] at map at <console>:51
scala> a.take(10).foreach{println}
(data science,DN,US,28,98,SMITH,data science)
(maths,DN,US,28,92,SMITH,maths)
(chemistry,DN,US,28,94,SMITH,chemistry)
(physics,DN,US,28,88,SMITH,physics)
(data science,DN,UK,25,93,JOHN,data science)
(maths,DN,UK,25,91,JOHN,maths)
(chemistry,DN,UK,25,95,JOHN,chemistry)
(physics,DN,UK,25,90,JOHN,physics)
(data science,DN,CA,29,67,MARK,data science)
(maths,DN,CA,29,68,MARK,maths)
scala>
so for the first row "data science" as string is key and "DN,US,28,98,SMITH,data science" is value as a string 因此对于第一行,“数据科学”作为字符串是键,而“ DN,US,28,98,SMITH,数据科学”则作为字符串值
now I want to find 2nd highest using group by 现在我想使用分组依据查找第二高的
scala> a.groupByKey().flatMap(rec=>{ val max = rec._2.toList.map(x=>x.split(',')(3).toFloat).distinct.sortBy(x=>(-x)).zipWithIndex.filter(x=>x._2==2).toMap.keys
| rec._2.toList.filter{x=>x.split(',')(3).toFloat==max}
| }).take(15).foreach{println}
scala>
I am getting nothing here 我什么都没得到
if i run this hard-coded i get value 如果我运行此硬编码,我将获得价值
scala> a.groupByKey().flatMap(rec=>{ val max = "98"
| rec._2.toList.sortBy(x=>(-x.split(',')(3).toFloat)).takeWhile(rec=> max.contains(rec.split(',')(3)))}).take(15).foreach{println}
DN,IND,26,98,XMAN,maths
DPS,US,28,98,XOMAN,chemistry
DN,US,28,98,SMITH,data science
also this gives me value 这也给我价值
scala> a.groupByKey().flatMap(rec=>{ rec._2.toList.map(x=>x.split(',')(3).toFloat).distinct.sortBy(x=>(-x)).zipWithIndex.filter(x=>x._2==2).map(_._1)}).take(15).foreach{println}
94.0
92.0
95.0
93.0
some more complex code gives me output 一些更复杂的代码给我输出
scala> a.groupByKey().flatMap(rec=>{ val max = rec._2.toList.map(x=>x.split(',')(3).toFloat).distinct.sortBy(x=>(-x)).take(1)
| rec._2.toList.sortBy(x=>(-x.split(',')(3).toFloat)).takeWhile(rec=> max.contains(rec.split(',')(3).toFloat))}).take(15).foreach{println}
DN,IND,26,98,XMAN,maths
DPS,UK,25,96,SOMK,physics
DPS,US,28,98,XOMAN,chemistry
DN,US,28,98,SMITH,data science
looks like there is some data type mismatch when i am using zipwithindex. 当我使用zipwithindex时,似乎有些数据类型不匹配。 Can some one help me here
有人可以帮我吗
There is a type mismatch due to .toMap.keys
. 由于
.toMap.keys
导致类型不匹配。 In the result, val max
is of type Iterable[Float], because method keys
returns Iterable[A]. 结果,val
max
的类型为Iterable [Float],因为方法keys
返回Iterable [A]。
On of the solution would be an addition of head
at the end of max
calculation: 解决方案的其中一个是在
max
计算的末尾添加head
:
val max = rec._2.toList
.map(x => x.split(',')(3).toFloat)
.distinct
.sortBy(x => (-x))
.zipWithIndex
.filter(x => x._2 == 2)
.toMap
.keys
.head
Basically, head
will return a value of type Float
. 基本上,
head
将返回Float
类型的值。 Then this code should at least compare equal types x.split(',')(3).toFloat == max
. 然后,此代码至少应比较相等的
x.split(',')(3).toFloat == max
。
Although, calling head
is not safe method. 虽然,调用
head
不是安全的方法。 It may throw an exception, if in your case a filter
function can return empty list. 如果您的情况下
filter
函数可以返回空列表,则可能会引发异常。 Then such exception will be thrown: 然后会抛出这样的异常:
java.util.NoSuchElementException: next on empty iterator
Once it work for concrete data sample, you can think to refactor this code to work with Set maybe. 一旦它适用于具体的数据样本,您就可以考虑重构此代码以使其与Set一起使用。 Instead of
head
do .keys.toSet
and compare like you did for other examples using max.contains(rec.split(',')(3))
取而代之的
head
做.keys.toSet
和比较像你这样使用其他例子max.contains(rec.split(',')(3))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.