简体   繁体   中英

Reading a large file in functional scala

I'm attempting to process a large binary file with scala. If possible I'd like to use a functional approach. My main method for this looks like this right now:

def getFromBis( buffer:List[Byte], bis:BufferedInputStream ):(Byte,List[Byte],Boolean) = {
    buffer match {
        case Nil =>
            val buffer2 = new Array[Byte](100000)
            bis.read(buffer2) match {
                case -1 => (-1,Nil,false)
                case _  => 
                    val buffer3 = buffer2.toList
                    (buffer3.head,buffer3.tail,true)
            }
        case b::tail => return (b,tail,true)
    }
}

It takes a list buffer and a buffered input stream. If the buffer isn't empty it simply returns the head and tail, if it is empty it gets the next chunk from the file and uses that as the buffer instead.

As you can see this isn't very functional. I'm trying to do this in a way where there's as few underlying io calls as possible, which is why I'm doing this in a chunked fashion. The problem here is the new Array. Everytime I run through the function it creates a new array, and judging by the constantly increasing memory usage as the program runs, I don't think they're getting destroyed.

My question is this: Is there a better way to be reading a large file in a chunked fashion using scala? I'd like to keep a completely functional approach, but at the very least I need a function which could act as a black box for the rest of my functional program.

You almost certainly don't want to store bytes in a List . You need a new object for each byte. That's really inefficient, and will cause probably 20x more memory usage than you need.

The easiest way to do this is to create an iterator that stores internal state:

class BisReader(bis: BufferedInputStream) {
  val buffer = new Array[Byte](100000)
  var n = 0
  var i = 0
  def hasNext: Boolean = (i < n) || (n >= 0 && {
    n = bis.read(buffer)
    i = 0
    hasNext
  })
  def next: Byte = {
    if (i < n) {
      val b = buffer(i)
      i += 1
      b
    }
    else if (hasNext) next
    else throw new IOException("Input stream empty")
  }
}
implicit def reader_as_iterator(br: BisReader) = new Iterator[Byte] {
  def hasNext = br.hasNext
  def next = br.next
}

One could have BisReader extend Iterator[Byte], but since Iterator isn't specialized, this would require boxing for raw next/hasNext access. This way, you can get low-level (next/hasNext) access at full speed when you need it, and use handy iterator methods otherwise.

Now you've isolated your ugly nonfunctional Java IO stuff in a single class with a clean interface, and can go back to being functional.


Edit: except, of course, IO is order-dependent and has side effects, but the previous method doesn't get around that either.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM