简体   繁体   English

使用scala / spark获得排序rdd中的最大值

[英]Getting the largest value in sorted rdd using scala/spark

I have the logs like: 我有类似的日志:

ERROR: Error fetching remote repo 'origin'
...
ERROR: SVN Problem
..
ERROR: Error fetching remote repo 'origin'
ERROR: Error fetching remote repo 'origin'

I wrote the below function to sort the errors, based on number of occurence of errors: 我编写了以下函数,根据错误发生的次数对错误进行排序:

val getErrorLines=lines.filter(value=>value.startsWith("ERROR"))
val mappedErrors=getErrorLines.map((s => {
val substrings = s.split(":")
(substrings(1), substrings(0))
})).map(value=>(value,1)).reduceByKey(_+_).sortBy(_._2, false, 1)

I am able to get the sorted list of errors based on the number of occurrence of the error: 我能够根据错误发生的次数获取错误的排序列表:

(( Error fetching remote repo 'origin',ERROR),5)
(( SVN Problem,ERROR),1)

But i want only the highest occurring error which is : 但是我只希望出现的最高错误是:

(( Error fetching remote repo 'origin',ERROR),5)

I piped the top() function on the sort , but it still gave me : 我通过管道传递了top()函数,但它仍然给了我:

(( SVN Problem,ERROR),1)

Is there any other function which would give me the largest occurance based on the value? 是否有其他函数可以根据该值使我出现最多?

The simplest solutions is to swap 最简单的解决方案是swap

val substrings = sc.parallelize(Seq(
  (("Error fetching remote repo 'origin'", "ERROR"), 5),
  (("SVN Problem", "ERROR"), 1)
))

substrings.map(_.swap).top(1)
// Array[(Int, (String, String))] = Array((5,(Error fetching remote repo 'origin',ERROR)))

You can use the max method on RDD 您可以在RDD上使用max方法

If the default result is not ok for your use-case, you can send an ordering function. 如果用例的默认结果不正确,则可以发送订购功能。 I think that in your case, since the highest is the one with the highest integer in the second part of the tuple this will work fine: 我认为在您的情况下,由于最高的是元组第二部分中整数最高的整数,因此可以正常工作:

rdd.max()(Ordering[Int].on(x=>x._2))

Quick test here: 快速测试在这里:

在此处输入图片说明

I would use first : 我会first使用:

val mostFrequentError = 
s.split(":")(substrings(1), substrings(0))}))
.map(value=>(value,1))
.reduceByKey(_+_)
.sortBy(_._2, false, 1)
.first() 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM