简体繁体 English

Java：适用于大数据量的通用BaseN编码器/解码器

[英]Java: Universal BaseN encoder/decoder working with large data sizes

原文 2016-11-09 12:44:06 6 3 java/ string/ converter/ encoder/ base-n

I'm looking for a decent BaseN encoder (with custom charset) in Java, that is not limited by input data size (array of bytes). 我正在寻找Java中不错的BaseN编码器（具有自定义字符集），该编码器不受输入数据大小（字节数组）的限制。

Something like this: 像这样：

https://github.com/mklemm/base-n-codec-java https://github.com/mklemm/base-n-codec-java

But for "unlimited" data length without any unnecessary memory/performance penalty and "BigInteger abuse magic". 但是对于“无限”的数据长度，没有任何不必要的内存/性能损失和“ BigInteger滥用魔术”。 Simply something that works as standard BASE64 encoders, but universally for any base/charset. 可以简单地用作标准BASE64编码器的东西，但通常适用于任何基本/字符集。 Any solution, or idea how to achieve that is welcomed. 任何解决方案或想法如何实现都受到欢迎。

Maybe, if someone has experiences with apache BaseNCodec: 也许，如果有人有使用Apache BaseNCodec的经验：

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/BaseNCodec.html https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/BaseNCodec.html

It looked promising, however it's an Abstract class, and available implemetations look harder to make, than start from scratch. 它看起来很有希望，但是它是一个Abstract类，可用的实现看起来比从头开始要难。

I need it for a binary data to custom character set encoder (where the number of characters in the set is mutable, "ABCDE" = Base5 , "ABCDE-+*/." = Base10 , ...). 我需要它来将二进制数据转换为自定义字符集编码器（其中字符集中的字符数是可变的， "ABCDE" = Base5 "ABCDE-+*/." = Base10 ， "ABCDE-+*/." = Base10 ，...）。

Update: The "Base N Codec" from GitHub (above) seems to be buggy, so I used the following code at the end: 更新： GitHub上的“ Base N Codec”（上面）似乎有问题，因此我在最后使用了以下代码：

https://dzone.com/articles/base-x-encoding https://dzone.com/articles/base-x-encoding

3 个解决方案

A base N encoding is quite efficient if N is a power of 2, as then conversion can happen between fixed size groups of digits and a fixed size of bytes. 如果N为2的幂，则基本N编码非常有效，因为这样可以在固定大小的数字组和固定大小的字节之间进行转换。

Base64: 2 ⁶ - 6 bits per digit, hence 4 digits = 24 bits = 3 bytes. BASE64：2 ⁶ -每位6个比特，因此4位= 24个比特= 3个字节。

Otherwise school multiplication must happen over the entire length, resulting in much "BigInteger" calculation. 否则，必须在整个长度上进行学校乘法，从而导致大量的“ BigInteger”计算。

A bit faster instead of for instance repeatedly multiplying/dividing by the base N, is having an array of powers of N. 具有N的幂的数组而不是例如被基数N重复乘/除的更快。

For encoding of a byte array to digits you could use N ⁰ , N ¹ , N ² , N ³ , ... as byte arrays of lesser or equal lengths, and do repeated subtractions. 为了将字节数组编码为数字，可以使用N ⁰ ，N ¹ ，N ² ，N ³ ，...作为长度较小或相等的字节数组，并进行重复减法。

As byte is signed, short might be more suited. 由于byte是有符号的，所以short可能更适合。 Say if the highest byte of the number is 98 and the lessequal N-power is 12 then circa 7 is that digit. 假设数字的最高字节为98，而N的次幂不等于12，则该数字约为7。

For decoding of digits to a byte array the same powers might be used. 为了将数字解码为字节数组，可以使用相同的幂。

Have fun. 玩得开心。

General answer: No. Special case: Yes, for bases a power of 2. 一般回答：否。特殊情况：是，以2的幂为底。

Why? 为什么？ Because thoughts in the Q are in "strong competition" (actually probably "contradiction"). 因为Q中的想法处于“激烈竞争”（实际上可能是“矛盾”）中。

As input, you want to support an unlimited integer in some base N (think of it as a BigIntegerBaseN). 作为输入，您希望在某个基数N中支持一个无限的整数（可以将其视为BigIntegerBaseN）。 As output, you wat to support an unlimited integer in some base M (think of it as a BigIntegerBaseM). 作为输出，您希望在某个基数M中支持一个无限整数（可以将其视为BigIntegerBaseM）。
You want to carry out base conversion - which is mathematically defined as a series of (multiplications & additions) and divisions. 您要执行基本转换-在数学上定义为一系列（乘法和加法）和除法。 See http://www.cut-the-knot.org/recurrence/conversion.shtml and https://math.stackexchange.com/questions/48968/how-to-change-from-base-n-to-m . 参见http://www.cut-the-knot.org/recurrence/conversion.shtml和https://math.stackexchange.com/questions/48968/how-to-change-from-base-n-to-m 。
You want to find a way of calculating such results without doing multiplications and divisions on BigIntegers (in any base of implementation). 您想找到一种无需在BigIntegers上进行乘法和除法（在任何实现的基础上）而计算此类结果的方法。

Can you determine results of multiplication and division operations without carrying out multiplication and division calculations? 您可以在不执行乘法和除法计算的情况下确定乘法和除法运算的结果吗？ NO. 没有。 It's a contradiction. 这是一个矛盾。 When you get the results, by definition, you've carried out the calculation. 根据定义，当您获得结果时，便已经进行了计算。

So it's not a question of can you avoid the calcuations, but a question of how to streamline them. 因此，这不是可以避免计算的问题，而是如何简化计算的问题。

If N and/or M are in bases a power of 2, then multiplication/division can be calculated by simple bit-shifting = same calculation with major stream-lining. 如果N和/或M的底数为2的幂，则可以通过简单的移位=相同的计算和主要的流水线来计算乘法/除法。 That can be done by avoiding BigInteger calcs. 可以避免BigInteger计算。
Otherwise, you can cache certain repeated calculations, storing interim results in an array or HashMap, then you get the same calculations with streamlining. 否则，您可以缓存某些重复的计算，将临时结果存储在数组或HashMap中，然后通过精简获得相同的计算。 But BigInteger calcs are still required (but redundant repetitions are avoided). 但是仍然需要BigInteger计算（但是避免重复）。

Hope that helps your approach. 希望对您有所帮助。 :) :)

You mention two very different approaches. 您提到了两种截然不同的方法。 The BaseN algorithm used in Github implementation is using the mathematical notation of converting an integer between bases. Github实现中使用的BaseN算法使用的数学符号是在基数之间转换整数。 This is equivalent to saying that 10 is the same as 12 in base-8 arithmetic or 1010 in base-2 arithmetic. 这等效于说10与以8为基数的算术中的12或以10为基数2的算术中的1010相同。 The algorithm interprets the byte stream as a large number and converts to the assigned base. 该算法将字节流解释为一个大数字，然后转换为指定的基数。

Base64 is a very different approach, and you can see an example in Wikipedia Base64 page . Base64是一种非常不同的方法，您可以在Wikipedia Base64页面上看到一个示例。 The algorithm basically splits the input stream into an array of 6 bits to each element. 该算法基本上将输入流分成每个元素6位的数组。 2^6 = 64, thus the name Base64. 2 ^ 6 = 64，因此名称为Base64。 It has a table with the 64 different characters and displays each element in the array (6-bit) to the corresponding conversion table. 它具有包含64个不同字符的表，并将数组（6位）中的每个元素显示到相应的转换表中。

I think that you need to select one of the two approaches, since they are very different and not compatible with each other. 我认为您需要选择两种方法之一，因为它们非常不同并且彼此不兼容。 As for the implementation details, if opting for the second method, this would easier to implement I think, since you basically split into fixed-size parts the stream and encode it according to your own table. 至于实现细节，我认为如果选择第二种方法，这将更容易实现，因为您基本上将流分割成固定大小的部分，并根据自己的表对其进行编码。

The first method can get quite complicated, since arbitrary arithmetic operations rely on quite complex constructs. 第一种方法可能会变得非常复杂，因为任意算术运算都依赖于非常复杂的结构。 You can have a look at exist software, again @ Wikipedia' s list of arbitrary-precision arithmetic software . 您可以看看现有的软件，再看看Wikipedia的任意精度算术软件列表。

Realistically, I think at some point you will find it hard to get characters for your conversions (as the base goes up or the number of bits goes up), unless you will be using the whole Unicode alphabet :). 实际上，我认为您有时会很难获得转换字符（随着基数的增加或位数的增加），除非您要使用整个Unicode字母:)。

Hope I helped a bit 希望我能有所帮助