I have the logs like:
ERROR: Error fetching remote repo 'origin'
...
ERROR: SVN Problem
..
ERROR: Error fetching remote repo 'origin'
ERROR: Error fetching remote repo 'origin'
I wrote the below function to sort the errors, based on number of occurence of errors:
val getErrorLines=lines.filter(value=>value.startsWith("ERROR"))
val mappedErrors=getErrorLines.map((s => {
val substrings = s.split(":")
(substrings(1), substrings(0))
})).map(value=>(value,1)).reduceByKey(_+_).sortBy(_._2, false, 1)
I am able to get the sorted list of errors based on the number of occurrence of the error:
(( Error fetching remote repo 'origin',ERROR),5)
(( SVN Problem,ERROR),1)
But i want only the highest occurring error which is :
(( Error fetching remote repo 'origin',ERROR),5)
I piped the top() function on the sort , but it still gave me :
(( SVN Problem,ERROR),1)
Is there any other function which would give me the largest occurance based on the value?
The simplest solutions is to swap
val substrings = sc.parallelize(Seq(
(("Error fetching remote repo 'origin'", "ERROR"), 5),
(("SVN Problem", "ERROR"), 1)
))
substrings.map(_.swap).top(1)
// Array[(Int, (String, String))] = Array((5,(Error fetching remote repo 'origin',ERROR)))
You can use the max
method on RDD
If the default result is not ok for your use-case, you can send an ordering function. I think that in your case, since the highest is the one with the highest integer in the second part of the tuple this will work fine:
rdd.max()(Ordering[Int].on(x=>x._2))
Quick test here:
I would use first
:
val mostFrequentError =
s.split(":")(substrings(1), substrings(0))}))
.map(value=>(value,1))
.reduceByKey(_+_)
.sortBy(_._2, false, 1)
.first()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.