简体   繁体   English

压缩java中的整数数组

[英]Compressing array of integers in java

I have some extremely large array of integers which i would like to compress. 我有一些非常大的整数数组,我想压缩。
However the way to do it in java is to use something like this - 然而,在java中这样做的方法是使用这样的东西 -

int[] myIntArray;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(1024);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(new DeflaterOutputStream(byteArrayOutputStream));
objectOutputStream.writeObject(myIntArray);

Note that the int array first needs to be converted to bytes by java. 请注意,首先需要通过java将int数组转换为字节。 Now I know that is fast but it still needs to create a whole new byte array and scan through the entire original int array converting it to bytes and copying the value to the new byte array. 现在我知道这很快但它仍然需要创建一个全新的字节数组并扫描整个原始int数组,将其转换为字节并将值复制到新的字节数组。

Is there any way to skip the byte conversion and make it compress the integers right away? 有没有办法跳过字节转换并使其立即压缩整数?

Skip the ObjectOutputStream and just store the int s directly as four byte s each. 跳过ObjectOutputStream ,只将int直接存储为四个byte DataOutputStream.writeInt for instance is an easy way to do it. 例如, DataOutputStream.writeInt是一种简单的方法。

Hmm. 嗯。 A general-purpose compression algorithm won't necessarily do a good job compressing an array of binary values, unless there's a lot of redundancy. 除非存在大量冗余,否则通用压缩算法不一定能很好地压缩二进制值数组。 You might do better to develop something of your own, based on what you know about the data. 根据您对数据的了解,您可能会更好地开发自己的东西。

What is it that you're actually trying to compress? 你实际上试图压缩的是什么?

You could use the representation used by Protocol Buffers . 您可以使用Protocol Buffers使用的表示形式 Each integer is represented by 1-5 bytes, depending on its magnitude. 每个整数由1-5个字节表示,具体取决于其大小。

Additionally, the new "packed" representation means you get basically a bit of "header" to say how big it is (and which field it's in) and then just the data. 此外,新的“打包”表示意味着你基本上得到一个“标题”来说明它有多大(以及它在哪个字段)然后只是数据。 That's probably what ObjectOutputStream does as well, but it's a recent innovation in PB :) 这可能是ObjectOutputStream作用,但它是PB最近的一项创新:)

Note that this will compress based on magnitude, not based on how often the integer has seen. 请注意,这将根据幅度进行压缩, 而不是基于整数的频率。 That will dramatically affect whether it's useful for you or not. 这将极大地影响它是否对你有用。

A byte array is not going to save you much memory unless you make it a byte array holding unsigned ints, which is very dangerous in Java. 一个字节数组不会为你节省太多内存,除非你把它作为一个包含无符号整数的字节数组,这在Java中是非常危险的。 It will replace memory overhead with extra processing time for the step checking of the code. 它将用额外的处理时间替换内存开销,以便对代码进行步骤检查。 This may be aright for data storage, but there already is data storage solution out there. 这可能适合数据存储,但已有数据存储解决方案。
Unless you are doing this for serialization purposes, I think that you are wasting your time. 除非你为了序列化目的这样做,否则我认为你在浪费你的时间。

If the array of ints is guaranteed to have no duplicates, you can use a java.util.BitSet, instead. 如果保证int的数组没有重复项,则可以使用java.util.BitSet。

As its base implementation is an array of bits, with each bit indicating if a certain integer is present or not in the BitSet, its memory usage is quite low, therefore needing less space to be serialized. 由于其基本实现是一个位数组,每个位指示BitSet中是否存在某个整数,因此其内存使用率非常低,因此需要较少的空间来进行序列化。

In your example, you are writing the compressed stream to the ByteArrayOutputStream. 在您的示例中,您将压缩流写入ByteArrayOutputStream。 Your compressed array needs to exist somewhere, and if the destination is memory, then ByteArrayOutputStream is your likely choice. 您的压缩数组需要存在于某处,如果目标是内存,则可能选择ByteArrayOutputStream。 You could also write the stream to a socket or file. 您还可以将流写入套接字或文件。 In that case, you wouldn't duplicate the stream in memory. 在这种情况下,您不会在内存中复制流。 If your array is 800MB and your running in a 1GB, you could easily write the array to a compressed file with the example you included. 如果您的阵列是800MB并且运行速度为1GB,则可以使用您包含的示例轻松地将阵列写入压缩文件。 The change would be replacing the ByteArrayOutputStream with a file stream. 更改将使用文件流替换ByteArrayOutputStream。

The ObjectOutputStream format is actually fairly efficient. ObjectOutputStream格式实际上非常有效。 It will not duplicate your array in memory, and has special code for efficiently writing arrays. 它不会在内存中复制您的数组,并且具有有效编写数组的特殊代码。

Are wanting to work with the compressed array in memory? 想要在内存中使用压缩数组吗? Would you data lend itself well to a sparse array? 你的数据是否适合稀疏阵列? Sparse array's are good when you have large gaps in your data. 当数据中存在较大间隙时,稀疏数组很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM