火花scala类型与groupbykey中的zipwithIndex不匹配

Question

I am trying to test groupByKey to find nth highest score of a subject 我正在尝试测试groupByKey以找到该主题的第n个最高分

my data looks like this 我的数据看起来像这样

scala> a
res176: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[263] at map at <console>:51

scala> a.take(10).foreach{println}
(data science,DN,US,28,98,SMITH,data science)
(maths,DN,US,28,92,SMITH,maths)
(chemistry,DN,US,28,94,SMITH,chemistry)
(physics,DN,US,28,88,SMITH,physics)
(data science,DN,UK,25,93,JOHN,data science)
(maths,DN,UK,25,91,JOHN,maths)
(chemistry,DN,UK,25,95,JOHN,chemistry)
(physics,DN,UK,25,90,JOHN,physics)
(data science,DN,CA,29,67,MARK,data science)
(maths,DN,CA,29,68,MARK,maths)

scala>

so for the first row "data science" as string is key and "DN,US,28,98,SMITH,data science" is value as a string 因此对于第一行，“数据科学”作为字符串是键，而“ DN，US，28,98，SMITH，数据科学”则作为字符串值

now I want to find 2nd highest using group by 现在我想使用分组依据查找第二高的

scala> a.groupByKey().flatMap(rec=>{ val max = rec._2.toList.map(x=>x.split(',')(3).toFloat).distinct.sortBy(x=>(-x)).zipWithIndex.filter(x=>x._2==2).toMap.keys
     | rec._2.toList.filter{x=>x.split(',')(3).toFloat==max}
     | }).take(15).foreach{println}

scala>

I am getting nothing here 我什么都没得到

if i run this hard-coded i get value 如果我运行此硬编码，我将获得价值

scala> a.groupByKey().flatMap(rec=>{ val max = "98"
     | rec._2.toList.sortBy(x=>(-x.split(',')(3).toFloat)).takeWhile(rec=> max.contains(rec.split(',')(3)))}).take(15).foreach{println}
DN,IND,26,98,XMAN,maths
DPS,US,28,98,XOMAN,chemistry
DN,US,28,98,SMITH,data science

also this gives me value 这也给我价值

scala> a.groupByKey().flatMap(rec=>{ rec._2.toList.map(x=>x.split(',')(3).toFloat).distinct.sortBy(x=>(-x)).zipWithIndex.filter(x=>x._2==2).map(_._1)}).take(15).foreach{println}
94.0
92.0
95.0
93.0

some more complex code gives me output 一些更复杂的代码给我输出

scala> a.groupByKey().flatMap(rec=>{ val max = rec._2.toList.map(x=>x.split(',')(3).toFloat).distinct.sortBy(x=>(-x)).take(1)
     | rec._2.toList.sortBy(x=>(-x.split(',')(3).toFloat)).takeWhile(rec=> max.contains(rec.split(',')(3).toFloat))}).take(15).foreach{println}
DN,IND,26,98,XMAN,maths
DPS,UK,25,96,SOMK,physics
DPS,US,28,98,XOMAN,chemistry
DN,US,28,98,SMITH,data science

looks like there is some data type mismatch when i am using zipwithindex. 当我使用zipwithindex时，似乎有些数据类型不匹配。 Can some one help me here 有人可以帮我吗

Answer 1

There is a type mismatch due to .toMap.keys . 由于.toMap.keys导致类型不匹配。 In the result, val max is of type Iterable[Float], because method keys returns Iterable[A]. 结果，val max的类型为Iterable [Float]，因为方法keys返回Iterable [A]。

On of the solution would be an addition of head at the end of max calculation: 解决方案的其中一个是在max计算的末尾添加head ：

  val max = rec._2.toList
    .map(x => x.split(',')(3).toFloat)
    .distinct
    .sortBy(x => (-x))
    .zipWithIndex
    .filter(x => x._2 == 2)
    .toMap
    .keys
    .head

Basically, head will return a value of type Float . 基本上， head将返回Float类型的值。 Then this code should at least compare equal types x.split(',')(3).toFloat == max . 然后，此代码至少应比较相等的x.split(',')(3).toFloat == max 。

Although, calling head is not safe method. 虽然，调用head不是安全的方法。 It may throw an exception, if in your case a filter function can return empty list. 如果您的情况下filter函数可以返回空列表，则可能会引发异常。 Then such exception will be thrown: 然后会抛出这样的异常：

java.util.NoSuchElementException: next on empty iterator

Once it work for concrete data sample, you can think to refactor this code to work with Set maybe. 一旦它适用于具体的数据样本，您就可以考虑重构此代码以使其与Set一起使用。 Instead of head do .keys.toSet and compare like you did for other examples using max.contains(rec.split(',')(3)) 取而代之的head做.keys.toSet和比较像你这样使用其他例子max.contains(rec.split(',')(3))

火花scala类型与groupbykey中的zipwithIndex不匹配

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-02-02 18:51:12

火花scala类型与groupbykey中的zipwithIndex不匹配

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-02-02 18:51:12

解决方案1
0 已采纳 2019-02-02 18:51:12