简体   繁体   English

如何在Java中压缩String?

[英]How to compress a String in Java?

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string. 我使用GZIPOutputStreamZIPOutputStream来压缩String(我的string.length()小于20),但压缩结果比原始字符串长。

On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings. 在某些网站上,我发现有些朋友说这是因为我原来的字符串太短, GZIPOutputStream可以用来压缩更长的字符串。

so, can somebody give me a help to compress a String? 那么,有人可以给我一个压缩字符串的帮助吗?

My function is like: 我的功能如下:

String compress(String original) throws Exception {

}

Update: 更新:

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;


//ZipUtil 
public class ZipUtil {
    public static String compress(String str) {
        if (str == null || str.length() == 0) {
            return str;
        }

        ByteArrayOutputStream out = new ByteArrayOutputStream();
        GZIPOutputStream gzip = new GZIPOutputStream(out);
        gzip.write(str.getBytes());
        gzip.close();
        return out.toString("ISO-8859-1");
    }

    public static void main(String[] args) throws IOException {
        String string = "admin";
        System.out.println("after compress:");
        System.out.println(ZipUtil.compress(string));
    }
}

The result is : 结果是:

替代文字

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space. 压缩算法几乎总是具有某种形式的空间开销,这意味着它们仅在压缩数据时有效,该数据足够大以至于开销小于节省的空间量。

Compressing a string which is only 20 characters long is not too easy, and it is not always possible. 压缩一个只有20个字符长的字符串并不容易,但并不总是可行。 If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much. 如果你有重复,霍夫曼编码或简单的游程编码可能能够压缩,但可能不是很多。

When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. 创建String时,可以将其视为char的列表,这意味着对于String中的每个字符,您需要支持char的所有可能值。 From the sun docs 来自太阳博士

char : The char data type is a single 16-bit Unicode character. char :char数据类型是一个16位Unicode字符。 It has a minimum value of '\' (or 0) and a maximum value of '\￿' (or 65,535 inclusive). 它的最小值为'\\ u0000'(或0),最大值为'\\ uffff'(或65,535(含))。

If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. 如果你想要支持一组简化的字符,你可以编写一个简单的压缩算法,类似于binary-> decimal-> hex radix converstion。 You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc. 您从65,536(或目标系统支持的多个字符)到26(字母)/ 36(字母数字)等。

I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests! 我曾经多次使用过这个技巧,例如将时间戳编码为文本(目标36 +,源10) - 只需确保你有足够的单元测试!

If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size. 如果密码或多或少“随机”,那么你运气不好,你将无法大幅减少尺寸。

But: Why do you need to compress the passwords? 但是:为什么需要压缩密码? Maybe what you need is not a compression, but some sort of hash value? 也许你需要的不是压缩,而是某种哈希值? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. 如果您只需要检查名称是否与给定密码匹配,则不需要保存密码,但可以保存密码的哈希值。 To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. 要检查键入的密码是否与给定名称匹配,您可以采用相同的方式构建哈希值,并将其与保存的哈希值进行比较。 As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes). 由于散列(Object.hashCode())是一个int,您将能够以80个字节存储所有20个密码哈希值。

Your friend is correct. 你的朋友是对的。 Both gzip and ZIP are based on DEFLATE . gzip和ZIP都基于DEFLATE This is a general purpose algorithm, and is not intended for encoding small strings. 这是一种通用算法,不适用于编码小字符串。

If you need this, a possible solution is a custom encoding and decoding HashMap<String, String> . 如果需要,可能的解决方案是自定义编码和解码HashMap<String, String> This can allow you to do a simple one-to-one mapping: 这可以让您进行简单的一对一映射:

HashMap<String, String> toCompressed, toUncompressed;

String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);

Clearly, this requires setup, and is only practical for a small number of strings. 显然,这需要设置,并且仅适用于少量字符串。

霍夫曼编码可能有所帮助,但前提是你的小字符串中有很多频繁的字符

The ZIP algorithm is a combination of LZW and Huffman Trees . ZIP算法是LZWHuffman Trees的组合。 You can use one of theses algorithms separately. 您可以单独使用这些算法之一。

The compression is based on 2 factors : 压缩基于2个因素:

  • the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. 原始链中的子串的重复(LZW):如果有很多重复,压缩将是有效的。 This algorithm has good performances for compressing a long plain text, since words are often repeated 该算法具有良好的压缩长文本的性能,因为经常重复单词
  • the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient 压缩链中每个字符的数量(Huffman):字符之间的重新分配越多,压缩就越有效

In your case, you should try the LZW algorithm only. 在您的情况下,您应该只尝试LZW算法。 Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression. 基本上使用,链可以在不添加元信息的情况下进行压缩:对于短字符串压缩,它可能更好。

For the Huffman algorithm, the coding tree has to be sent with the compressed text. 对于霍夫曼算法,编码树必须与压缩文本一起发送。 So, for a small text, the result can be larger than the original text, because of the tree. 因此,对于小文本,由于树,结果可能比原始文本大。

Huffman encoding is a sensible option here. 霍夫曼编码是一个明智的选择。 Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. Gzip和朋友这样做,但他们的工作方式是为输入构建一个Huffman树,发送它,然后发送用树编码的数据。 If the tree is large relative to the data, there may be no not saving in size. 如果树相对于数据较大,则可能没有节省大小。

However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. 但是,可以避免发送树:相反,您安排发送方和接收方已经拥有树。 It can't be built specifically for every string, but you can have a single global tree used to encode all strings. 它不能专门为每个字符串构建,但您可以使用一个全局树来编码所有字符串。 If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input. 如果你使用与输入字符串(英语或其他)相同的语言构建它,你仍然应该获得良好的压缩,尽管不如每个输入的自定义树一样好。

If you know that your strings are mostly ASCII you could convert them to UTF-8. 如果您知道您的字符串主要是ASCII,则可以将它们转换为UTF-8。

byte[] bytes = string.getBytes("UTF-8");

This may reduce the memory size by about 50%. 这可能会使内存大小减少约50%。 However, you will get a byte array out and not a string. 但是,您将获得一个字节数组而不是字符串。 If you are writing it to a file though, that should not be a problem. 如果你把它写到文件中,那应该不是问题。

To convert back to a String: 要转换回字符串:

private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);

Take a look at the Huffman algorithm. 看看Huffman算法。

https://codereview.stackexchange.com/questions/44473/huffman-code-implementation https://codereview.stackexchange.com/questions/44473/huffman-code-implementation

The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence). 这个想法是每个字符都被比特序列替换,取决于它们在文本中的频率(频率越高,序列越小)。

You can read your entire text and build a table of codes, for example: 您可以阅读整个文本并构建代码表,例如:

Symbol Code 符号代码

a 0 一个0

s 10 10

e 110 e 110

m 111 m 111

The algorithm builds a symbol tree based on the text input. 该算法基于文本输入构建符号树。 The more variety of characters you have, the worst the compression will be. 你拥有的角色种类越多,压缩效果就越差。

But depending on your text, it could be effective. 但根据你的文字,它可能是有效的。

You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. 您没有看到您的String发生任何压缩,因为您至少需要几百个字节才能使用GZIPOutputStream或ZIPOutputStream进行实际压缩。 Your String is too small.(I don't understand why you require compression for same) 你的字符串太小了。(我不明白为什么你需要压缩)

Check Conclusion from this article : 查看本文的结论:

The article also shows how to compress and decompress data on the fly in order to reduce network traffic and improve the performance of your client/server applications. 本文还介绍了如何动态压缩和解压缩数据,以减少网络流量并提高客户端/服务器应用程序的性能。 Compressing data on the fly, however, improves the performance of client/server applications only when the objects being compressed are more than a couple of hundred bytes. 但是,只有当被压缩的对象超过几百个字节时,动态压缩数据才能提高客户端/服务器应用程序的性能。 You would not be able to observe improvement in performance if the objects being compressed and transferred are simple String objects, for example. 例如,如果被压缩和传输的对象是简单的String对象,则无法观察到性能的提高。

Compact string enhancement is available out of the box in Java 9 https://openjdk.java.net/jeps/254 Java 9中提供了紧凑的字符串增强功能https://openjdk.java.net/jeps/254

java.lang.String now has: java.lang.String现在有:

private final byte[] value; private final byte [] value;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM