简体   繁体   中英

Interval merge using Spark Scala

I have list of intervals which I would like to merge whenever there is an overlap .

example: List((1,1),(2,2),(4,4),(5,5)) The desired output here is List((1,2),(4,5))

I have a list of numbers worth 2.5GB, which I would like to transform into ranges .

Note: There are no duplicates in the input list

Steps

  1. input: List[Int] .
  2. map to List of tuple: List((a,a),(b,b), ...) .
  3. reduce it with range merge logic .
val l = List(List((1,1)),List((2,2)),List((4,4)),List((5,5)))
val list =sc.parallelize(l)

def merge(r1:(Int,Int),r2:(Int,Int))  :(Int,Int)  = {
    if(r1._2+1==r2._1) (r1._1,r2._2)
    else if(r2._2+1 == r1._1) (r2._1,r1._2)
    else null
}

val res = list.reduce((x,y) =>{
   x.map(r1 => {
        y.map(r2 => {
            val m = merge(r1,r2)
             m match {
                case null => List(r1,r2)
                case _ => List(m)
             }
         }).flatten
    }).flatten
})

res: List[(Int, Int)] = List((4,5), (2,2), (1,2))

The actual output is res: List[(Int, Int)] = List((4,5), (2,2), (1,2)) where as I expect List((4,5),(1,2)) .

edit : my solution

I tried following code. It seems working with small input but taking too long for my original data. Is there any better solution than this?

def overlap(x: (Int,Int),y:(Int,Int)) = {
    if(x._2==y._1) (x._1,y._2)
    else if(x._1==y._2) (y._1,x._2)
    else null
}

def isOverlapping(x: (Int,Int),y:(Int,Int)) = {
    x._1 == y._1 || x._1 == y._2 || x._2==y._1 || x._2==y._2 
}

val res = list.reduce((x,y) =>{
  val z =  x.map(r1 => {
        y.map(r2 => {
            val m = merge(r1,r2)
             m match {
                case null => List(r1,r2)
                case _ =>{
                     List(m)
                }
             }
         }).flatten
    }).flatten
    
//-------compressing the accumulated list z to merge overlapping tuples

    z.foldLeft(List[(Int,Int)]()) { (acc, i) => {
    if (!acc.exists(isOverlapping(i, _)))
        i +: acc
      else
        acc.map(x => {
            val m = overlap(x,i)
             m match {
                case null => x
                case _ => m
             }
        })
    }}
//---------

})


res: List[(Int, Int)] = List((4,5), (1,2))

I solved this problem recently. I used List[List[Int]] as my collection.

My approach is to use sorted collection, that way when we actually try to reduce the overlapping intervals, We take advantage of sorting (We use start position for the key to sort to begin with but if both the start positions are equal, we go with end position) and can complete the problem in O(nlogn) complexity. I specifically used sorted Set just so that if there are duplicate intervals it is removed before we further reduce them.

Once the collection is sorted, then we have to just check if the adjacent pairs are overlapping or not. I do that by checking 1stPair.end >= 2ndPair.Start. If true, it means that the Pairs are overlapping and we can change these 2 Pairs into 1 pair by taking (1stPair.start,max(1stPair.end,2ndPair.end)). Here no need to check the start intervals between pairs because it is ordered so 2ndPair.start will be always >= 1stPair.start. This is the saving we get by using sorted collection.

I assumed, if the pairs are adjacent to each other without overlapping, still I consider that as an overlap and reduced it. For eg ([1,2],[2,3] is reduced to [1,3]). Time Complexity of the whole solution is complexity of sorting. Since I use inbuilt sorting algorithm comes with SortedSet, I guess it provides the fastest sort(O(nlogn). For reducing the intervals, it is only 1 pass through the collection so the complexity is linear. Comparing both these complexity and “O(n)” is less significant than “O(nlogn)”. So the overall complexity is “O(nlogn)”. This is run in Scala Worksheet and checked for couple of other inputs, it is working fine.

 import scala.collection.SortedSet
  object order extends Ordering[List[Int]] {
    override def compare(a: List[Int], b: List[Int]): Int = {
      if (a(0) != b(0)) a(0) - b(0)
      else a(1) - b(1)
    }
  }
  val sorted_input = SortedSet(List(6, 9), List(1, 4), List(3, 5), List(8, 12))(order)
  def deduce(list: List[List[Int]], pair: List[Int]): List[List[Int]] = {
    val finalList = (pair, list) match {
      case (pair, head :: tail) if (pair(0) <= head(1)) => List(head(0), if (pair(1) > head(1)) pair(1) else head(1)) :: tail
      case (pair, emptyList)                            => pair :: emptyList
    }
    finalList
  }                                              
  sorted_input.foldLeft(List[List[Int]]())(deduce) //> res0: List[List[Int]] = List(List(6, 12), List(1, 5))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM