简体   繁体   English

Java 大量随机字符串的 Deflater

[英]Java Deflater for large set of random strings

I am using the Deflater class to try to compress a large set of random strings.我正在使用 Deflater class 来尝试压缩一大组随机字符串。 My compression and decompression methods look like this:我的压缩和解压方法是这样的:

public static String compressAndEncodeBase64(String text) {
        try {
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
                dos.write(text.getBytes());
            }
            byte[] bytes = os.toByteArray();

            return new String(Base64.getEncoder().encode(bytes));
        } catch (Exception e){
            log.info("Caught exception when trying to compress {}: ", text, e);
        }
        return null;
    }

public static String decompressB64(String compressedAndEncodedText) {
    try {
        byte[] decodedText = Base64.getDecoder().decode(compressedAndEncodedText);

        ByteArrayOutputStream os = new ByteArrayOutputStream();
        try (OutputStream ios = new InflaterOutputStream(os)) {
            ios.write(decodedText);
        }
        byte[] decompressedBArray = os.toByteArray();
        return new String(decompressedBArray, StandardCharsets.UTF_8);
    } catch (Exception e){
        log.error("Caught following exception when trying to decode and decompress text {}: ", compressedAndEncodedText, e);
        throw new BadRequestException(Constants.ErrorMessages.COMPRESSED_GROUPS_HEADER_ERROR);
    }
}

However, when I test this on a large set of random strings, my "compressed" string is larger than the original string.但是,当我在一大组随机字符串上对此进行测试时,我的“压缩”字符串比原始字符串大。 Even for a relatively small random string, the compressed data is longer.即使对于相对较小的随机字符串,压缩后的数据也更长。 For example, this unit test fails:例如,此单元测试失败:

@Test
    public void testCompressDecompressRandomString(){
        String orig = RandomStringUtils.random(71, true, true);
        String compressedString = compressAndEncodeBase64(orig.toString());
        Assertions.assertTrue((orig.toString().length() - compressedString.length()) > 0, "The decompressed string has length " + orig.toString().length() + ", while compressed string has length " + compressedString.length());
    }

Anyone can explain what's going on and a possible alternative?任何人都可以解释发生了什么以及可能的替代方案吗?

Note : I tried using the deflater without the base64 encoding:注意:我尝试使用没有 base64 编码的压缩器:

public static String compress(String data)  {
        Deflater new_deflater = new Deflater();
        new_deflater.setInput(data.getBytes(StandardCharsets.UTF_8));
        new_deflater.finish();
        byte compressed_string[] = new byte[1024];
        int compressed_size = new_deflater.deflate(compressed_string);
        byte[] returnValues = new byte[compressed_size];
        System.arraycopy(compressed_string, 0, returnValues, 0, compressed_size);
        log.info("The Original String: " + data + "\n Size: " + data.length());
        log.info("The Compressed String Output: " + new String(compressed_string) + "\n Size: " + compressed_size);
        return new String(returnValues, StandardCharsets.UTF_8);
    }

My test still fails however.但是我的测试仍然失败。

First off, you aren't going to get much or any compression on short strings.首先,您不会对短字符串进行太多或任何压缩。 Compressors need more data to both collect statistics on the data and to have previous data in which to look for repeated strings.压缩器需要更多数据来收集数据的统计信息,并使用以前的数据来查找重复的字符串。

Second, if you're testing with random data, you are further crippling the compressor, since now there are no repeated strings.其次,如果您使用随机数据进行测试,则会进一步削弱压缩器,因为现在没有重复的字符串。 For your test case with random alphanumeric strings, the only compression you can get is to take advantage of the fact that there are only 62 possible values for each byte.对于带有随机字母数字字符串的测试用例,您唯一可以获得的压缩是利用每个字节只有 62 个可能值这一事实。 That can be compressed by a factor of log(62)/log(256) = 0.744.可以将其压缩为 log(62)/log(256) = 0.744。 Even then, you need to have enough input to cancel the overhead of the code description.即使那样,您也需要有足够的输入来取消代码描述的开销。 Your test case of 71 characters will always be compressed to 73 bytes by deflate, which is essentially just copying the data with a small overhead.您的 71 个字符的测试用例将始终被 deflate 压缩为 73 个字节,这实际上只是以很小的开销复制数据。 There isn't enough input to justify the code description to take advantage of the limited character set.没有足够的输入来证明代码描述可以利用有限的字符集。 If I have 1,000,000 random characters from that set of 62, then deflate can compress that to about 752,000 bytes.如果我有 62 个字符集中的 1,000,000 个随机字符,则 deflate 可以将其压缩到大约 752,000 个字节。

Third, you are then expanding the resulting compressed data by a factor of 1.333 by encoding it using Base64. So if I take that compression by a factor of 0.752 and then expand it by 1.333, I get an overall expansion of 1.002, You won't get anywhere that way on random characters from a set of 62. no matter how long the input is.第三,然后通过使用 Base64 对其进行编码,将生成的压缩数据扩展为 1.333 倍。因此,如果我将压缩倍数为 0.752,然后将其扩展为 1.333,则总体扩展为 1.002,你不会从一组 62 个随机字符上,无论输入多长,都无法以这种方式获得任何结果。

Given all that, you need to do your testing on real-world inputs.考虑到所有这些,您需要对真实世界的输入进行测试。 I suspect that your application does not have randomly-generated data.我怀疑您的应用程序没有随机生成的数据。 Don't attempt compression on short strings.不要尝试压缩短字符串。 Combine your strings into much longer input, so that the compressor has something to work with.将您的字符串组合成更长的输入,以便压缩器可以使用。 If you must encode with Base64, then you must.如果你必须用 Base64 编码,那么你必须。 But expect that there may be expansion instead of compression.但预计可能会有扩展而不是压缩。 You could include in your output format an option for chunks to be compressed or not compressed, indicated by a leading byte.您可以在 output 格式中包含一个选项,用于压缩或不压缩块,由前导字节指示。 Then when compressing, if it doesn't compress, send it without compression instead.然后在压缩的时候,如果不压缩,就直接不压缩发送。 You can also try a more efficient encoding, eg Base85, or whatever number of characters you can transmit transparently.您还可以尝试更有效的编码,例如 Base85,或您可以透明传输的任何数量的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM