简体   繁体   English

从GZIPInputStream读取大量字节

[英]Reading large number of bytes from GZIPInputStream

I am reading a gzipped file through GZIPInputStream. 我正在通过GZIPInputStream阅读gzip压缩文件。 I want to read a large amount of data at once, but no matter how many bytes I ask the GZIPInputStream to read, it always reads far less number of bytes. 我想一次读取大量数据,但是无论我要求GZIPInputStream读取多少字节,它总是读取的字节数要少得多。 For example, 例如,

val bArray = new Array[Byte](81920)
val fis = new FileInputStream(new File(inputFileName))
val gis = new GZIPInputStream(fis)
val bytesRead =  gis.read(bArray)

The bytes read are always somewhere around 1800 bytes, while it should be nearly equal to the size of bArray, which is 81920 in this case. 读取的字节总是在1800字节左右,而它应该几乎等于bArray的大小,在这种情况下为81920。 Why is it like this? 为什么会这样呢? Is there a way to solve this problem, and really have more number of bytes read? 有没有办法解决此问题,并且确实读取了更多字节?

I would try using akka-streams in case you have large amount of data. 如果您有大量数据,我会尝试使用akka流。

  implicit val system = ActorSystem()
  implicit val ec = system.dispatcher
  implicit val materializer = ActorMaterializer()

  val fis = new FileInputStream(new File(""))
  val gis = new GZIPInputStream(fis) 
  val bfs: BufferedSource = Source.fromInputStream(gis)

bfs exposes the Flow api for stream processing. bfs公开用于流处理的Flow api。

You can also get a stream from that: 您还可以从中获得一个流:

val ss: Stream[String] = bfs.bufferedReader().lines()

The read might always return fewer bytes than you ask for, so in general you always have to loop, reading as many as you want. 读取返回的字节可能总是少于您要求的字节,因此通常您总是必须循环读取任意数量的字节。

In other words, giving GZIPInputStream a big buffer doesn't mean it will be filled on a given request. 换句话说,给GZIPInputStream一个大缓冲区并不意味着它将在给定请求中被填充。

import java.util.zip.GZIPInputStream
import java.io.FileInputStream
import java.io.File
import java.io.InputStream
import java.io.FilterInputStream

object Unzipped extends App {
  val inputFileName = "/tmp/sss.gz"
  val bArray = new Array[Byte](80 * 1024)
  val fis = new FileInputStream(new File(inputFileName))
  val stingy = new StingyInputStream(fis)
  val gis = new GZIPInputStream(stingy, 80 * 1024)
  val bytesRead = gis.read(bArray, 0, bArray.length)
  println(bytesRead)
}

class StingyInputStream(is: InputStream) extends FilterInputStream(is) {
  override def read(b: Array[Byte], off: Int, len: Int) = {
    val n = len.min(1024)
    super.read(b, off, n)
  }
}

So instead, loop to drain instead of issuing one read: 因此,请循环执行以耗尽资源,而不是发出一个读取:

  import reflect.io.Streamable.Bytes
  val sb = new Bytes {
    override val length = 80 * 1024L
    override val inputStream = gis
  }
  val res = sb.toByteArray()
  println(res.length)  // your explicit length

I'm not saying that's the API to use, it's just to demo. 我并不是说这是要使用的API,只是为了演示。 I'm too lazy to write a loop. 我懒得写一个循环。

OK, I found the solution. 好的,我找到了解决方案。 There is a version of constructor for GZIPInputStream that also takes the size of the buffer. GZIPInputStream有一个构造函数版本,该版本也采用缓冲区的大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM