简体   繁体   English

Java中BASE64类的编码/解码算法效率如何?

[英]How efficient is the encoding/decoding algorithm of BASE64 class in Java?

I am about to use an algorithm to encode a variable length but very long String field retrieved from an XML file, then that encoded data should be persisted in the database. 我将使用一种算法来编码从XML文件中检索的可变长度但非常长的String字段,然后该编码数据应该保留在数据库中。

Later, when I recieve a second file I need to fetch the encoded data from database (previously stored) and then decode it and validate with the new data for duplicate. 后来,当我收到第二个文件时,我需要从数据库中获取编码数据(先前存储过的),然后对其进行解码并使用新数据验证是否重复。

I tried org.apache.commons.codec.binary.Base64 class it has 2 methods: 我试过org.apache.commons.codec.binary.Base64类它有2个方法:

  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)

which works perfectly fine and solves my problem. 它完美无缺,解决了我的问题。 But it converts 55 char string to just 6 char String. 但它将55个字符串转换为仅6个字符串。

So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays. 所以我想知道是否存在这些算法编码2个字符串的情况,这些字符串非常大并且只有1个字符不匹配(例如)到相同的编码字节数组中。

I donot know about the Base64 class much but if anyone can help me out it will be really helpful. 我不太了解Base64课程,但如果有人能帮助我,那将非常有帮助。

If you can suggest any other Algorithm which makes a large String short of fixed length and solves my purpose I will be happy to use it. 如果你可以建议任何其他算法使一个大的String短的固定长度并解决我的目的,我将很乐意使用它。

Thanks in advance. 提前致谢。

Not very efficient. 不是很有效率。

Also, using sun.misc classes gives a non-portable application. 此外,使用sun.misc类提供了一个非便携式应用程序。

Check out the following performance comparisons from MiGBase64 : 查看MiGBase64的以下性能比较:

在此输入图像描述


So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays. 所以我想知道是否存在这些算法编码2个字符串的情况,这些字符串非常大并且只有1个字符不匹配(例如)到相同的编码字节数组中。

Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Base64不是散列算法,它是一种编码,因此必须是双向的。 Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. 必然不允许碰撞 - 否则解码将是非确定性的。 Base64 is designed to represent arbitrary binary data in an ASCII string. Base64旨在表示ASCII字符串中的任意二进制数据。 Encoding a Unicode string as Base64 will often increase the number of code points required since the Unicode character set requires multiple bytes. 将Unicode字符串编码为Base64通常会增加所需的代码点数 ,因为Unicode字符集需要多个字节。 The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. Unicode字符串的Base64表示形式将根据使用的编码(UTF-8,UTF-16)而有所不同。 For example: 例如:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

Solution 1 解决方案1

Use lossless compression 使用无损压缩

GZip( UTF8( "test" ) )

Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. 在这里,您将字符串转换为字节数组,并使用无损压缩来减少必须存储的字节数。 You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best. 您可以改变字符编码和压缩算法,以减少字节数,具体取决于您将要存储的字符串(即,如果它主要是ASCII,那么UTF-8可能是最好的。

Pros : no collisions, ability to recover original string 优点 :没有碰撞,恢复原始字符串的能力
Cons : Bytes required to store value is variable; 缺点 :存储值所需的字节是可变的; bytes required to store value is larger 存储值所需的字节数较大

Solution 2 解决方案2

Use a hashing algorithm 使用散列算法

SHA256( UTF8( "test" ) )

Here you are converting the string to a fixed length set of bytes with a hashing function. 在这里,您将使用散列函数将字符串转换为固定长度的字节集。 Hashing is uni-directional and by its nature collisions can be possible . 散列是单向的,并且本质上可以发生碰撞 However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions 但是,根据您希望处理的字符串和字符串数量,您可以选择哈希函数以最小化冲突的可能性

Pros : Bytes required to store value is fixed; 优点 :存储价值所需的字节数是固定的; bytes required to store value is small 存储值所需的字节很小
Cons : Collisions possible, no ability to recover original string 缺点 :碰撞可能,无法恢复原始字符串

I just saw your comment - it seems you're actually looking for compression rather than hashing as I initially thought. 我刚看到你的评论 - 看起来你实际上正在寻找压缩而不是像我最初想的那样散列。 Though in that case, you won't be able to get fixed length output for arbitrary input (think about it, an infinite number of inputs cannot map bijectively to a finite number of outputs), so I hope that wasn't a strong requirement. 虽然在这种情况下,你将无法获得任意输入的固定长度输出(考虑一下,无限数量的输入不能双射地映射到有限数量的输出),所以我希望这不是一个强烈的要求。

In any case, the performance of your chosen compression algorithm will depend on the characteristics of the input text. 在任何情况下,您选择的压缩算法的性能将取决于输入文本的特征。 In the absence of further information, DEFLATE compression (as used by the Zip input streams, IIRC) is a good general-purpose algorithm to start with, and at least use as a basis for comparison. 在没有进一步信息的情况下,DEFLATE压缩(由Zip输入流IIRC使用)是一个很好的通用算法,并且至少用作比较的基础。 For ease of implementation, though, you can use the Deflator class built into the JDK, which uses ZLib compression. 但是,为了便于实现,您可以使用JDK中内置的Deflator类,它使用ZLib压缩。

If your input strings have particular patterns, then different compression algorithms may be more or less efficient. 如果输入字符串具有特定模式,则不同的压缩算法可能效率更高或更低。 In one respect it doesn't matter which one you use, if you don't intend the compressed data to be read by any other processes - so long as you can compress and decompress yourself, it'll be transparent to your clients. 在一个方面,如果您不打算通过任何其他进程读取压缩数据,那么使用哪一个并不重要 - 只要您可以自己压缩和解压缩,它对您的客户来说就是透明的。

These other questions may be of interest: 这些其他问题可能会引起关注:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM