简体繁体 English

Java源代码生成中的霍夫曼代码解码器编码器

[英]Huffman Code Decoder Encoder In Java Source Generation

原文 2015-03-04 09:48:38 7 2 java/ performance/ parsing/ huffman-code

I want to create a fast Huffman Code decoder in Java and therefore thought about lookup tables. 我想用Java创建一个快速的霍夫曼代码解码器，因此考虑了查找表。 Since those tables consume memory and we use Java code to navigate and access the tables one can easily (or not) write a programm / method that expresses the same table. 由于这些表占用内存，并且我们使用Java代码导航和访问表，因此可以轻松（或不容易）编写表示同一表的程序/方法。

The problem with that approach is, I dont know what is the best strategy. 这种方法的问题是，我不知道什么是最佳策略。 I know it is a lot about what fits in the cache and branch prediction. 我知道很多有关缓存和分支预测的内容。 Also the switch case implementation meaning the actual ASM is beyond me. 同样，切换案例的实现意味着实际的ASM在我之外。 If I have a in memory lookup table (or a hierarchy of it) I will be able to simply jump in and out but I doupt that for my purposal that table would fit in the cache. 如果我有一个内存中查找表（或它的层次结构），则可以简单地跳入和跳出，但出于我的目的，我建议将该表放入高速缓存中。

Since I actually walk a tree one could implement it as if else statements requireing a certain number of comparisms but for each comparism it would need additional binary operations. 由于我实际上是走一棵树，因此可以像执行其他语句一样需要一定数量的比较来实现它，但是对于每个比较，它都需要附加的二进制运算。

So the following options exist: 因此存在以下选项：

General Algorithm using in Memory lookup tables 内存查找表中使用的通用算法
If/else representation of the decision tree 决策树的if / else表示
If/else representation with small switch statements to find the correct group of symboles (same bit pattern length) (fewer if statements, might be more code). if / else表示法，使用小的switch语句查找正确的符号组（相同的位模式长度）（if语句较少，可能是更多代码）。
Switch statement representation of the code 代码的switch语句表示

Writing and benchmarking is quite tricky so any initial thoughts would be great. 编写和基准测试非常棘手，因此任何初步想法都会很棒。

One additional problem that comes into play is the order of bits. 起作用的另一个问题是位的顺序。 The most significant bit comes always first meaning it is stored in reverse order. 最重要的位始终位于最前面，这意味着它以相反的顺序存储。

If your tree is A = 0, B = 10, C = 11 to write BAC it would actually be 01 + 0 + 11 (plus means append). 如果您的树是A = 0，B = 10，C = 11以编写BAC，则实际上为01 + 0 + 11（加上表示追加）。

So actually the code have to be written in reverse order. 因此，实际上必须以相反的顺序编写代码。 using if /else or switch approach for groups it would not be a problem since masking out the bits is simple and the reverse of bit is simply possible but it would lose the idea of getting the index within the group out of the mask since in reverse bit order add and remove have different meaning and also a simple lookup is not possible. 对组使用if / else或switch方法将不会有问题，因为屏蔽位很简单，并且位的反转也很可能，但是由于相反，它将失去将组内的索引移出掩码的想法位顺序的添加和删除具有不同的含义，并且不可能进行简单的查找。

Reversing the bits is a costly operation (I use 4bit lookup tables) not outweighting the performance penality of binary operations. 反转位是一项昂贵的操作（我使用4位查找表），不会超过二进制操作的性能损失。

But reversing the bits on the go is better suited for this and require four operations per bit (shifting up, Masking out, add and also shifting the input down). 但是，在旅途中反转位更适合此操作，并且每位需要进行四个操作（上移，屏蔽，加法以及下移输入）。 Since I read bits ahead all those operations will be done in registers so they might take only a few cycles. 由于我提前读取了所有这些操作，因此所有这些操作都将在寄存器中完成，因此它们可能只需要几个周期。

This way I can use switch, sub and if to find the right symbol group and also to return those. 这样，我可以使用switch，sub和if来找到正确的符号组并返回它们。

So finaly I need advices. 最后，我需要一些建议。 Since my codes are global for language processing, they can be hardwired (ie be in source). 由于我的代码在语言处理方面是全球通用的，因此可以进行硬连线（即在源代码中）。

I wonder what the parser generators like ANTRL use to express those decisions. 我想知道像ANTRL这样的解析器生成器是用来表达那些决定的。 Since they also seam to switch or if/else based on the input symbole it would might give me a clue. 由于他们也根据输入符号来缝制切换或是否/其他，这可能会给我一个提示。

[Updates] [更新]

I found a simplification that avoids the reverse bit problem but still adds costs per group. 我发现了一种简化方法，可以避免反向位问题，但仍会增加每组的成本。 So I end up in writing the bits in the order of the groups to traverse. 因此，我最终按照要遍历的组的顺序编写了位。 So I will not need four modifications per bit but per group (different bit lengths). 因此，我不需要每个位四个修改，而是每个组（不同的位长）。

For each group we have: 1. The value for the first element, the size (and therefore the value for the last element within that group. 对于每个组，我们都有：1.第一个元素的值，大小（以及该组中最后一个元素的值）。

Therefore for each group the algorithm looks like: 1. Read mbits and combine with the current read value. 因此，对于每个组，算法如下：1.读取mbits并与当前读取值组合。 2. Compare the value with the last value of that group is it smaller its within that group if not its outside. 2.将值与该组的最后一个值进行比较，如果它不在该组中，则该值在该组中较小。 -> read next 3. If it is inside the group aan array of values can be accessed or use a switch statement. ->接下来阅读3.如果在组内，则可以访问值数组或使用switch语句。

This is totally generic and can be used without loops making it efficient. 这是完全通用的，可以不使用循环就可以有效地使用。 Also if the group was detected, the bit length of the code is known and the bits can be consumed from source since the code looks far ahead (reading from stream). 同样，如果检测到该组，则代码的位长是已知的，并且由于代码看起来很远（从流中读取），因此可以从源中消耗这些位。

[Update 2] [更新2]

To access the actual value one could use a single big array of elements grouped by group. 要访问实际值，可以使用按组分组的单个大元素数组。 Since the propability reduces for group to group it is very likely that a significant part fits L2 or L1 cache speeding up access here. 由于按组分组的可传输性降低，因此很可能有相当一部分适合L2或L1缓存，从而加快了访问速度。

Or one uses switch statements. 或者使用switch语句。

[Update 3] [更新3]

Depending on the cases of a switch the compiler generates either a tableswitch or a lookup switch. 根据开关的情况，编译器会生成表开关或查找开关。 The lookup switch has a complexity of O(log n) and stores key, jmp offset pairs which is not preferable. 查找开关的复杂度为O（log n），并存储密钥，jmp偏移对，因此不理想。 Therefore checking for groups is better suited for if/else. 因此，检查组更适合if / else。

The tableswitch itself uses only a table of jump offsets and it only takes substract, compare, access, jmp to reach the destination, than it must executes a return value on a constant. tableswitch本身仅使用跳转偏移量表，并且仅需进行减法，比较，访问和jmp即可到达目的地，而它必须对常量执行返回值。

Therefore a table access looks more promising. 因此，表访问看起来更有希望。 Also to avoid an unnecessary jump each group might contain the logic to access and return the group symbols table. 另外，为了避免不必要的跳转，每个组可能包含访问和返回组符号表的逻辑。 Storing everything in a big table is promising since it might be int or short per symbole and my codes often do only have 1000 to 4000 symbols at most making it actually short. 将所有内容存储在一个大表中是有希望的，因为每个符号可能是int或short，而我的代码通常最多最多只有1000到4000个符号，因此实际上它很短。

I will check if 1 - pattern will give me the opportunity to store and access the masks in a better way allowing for binary searching the correct group instead of advancing in O(n) and might even avoid any shift operations at all during the processing. 我将检查1-模式是否将使我有机会以更好的方式存储和访问掩码，从而允许二进制搜索正确的组而不是在O（n）中前进，甚至可能在处理期间完全避免任何移位操作。

2 个解决方案

I couldn't make sense of most of what you wrote in your (long) question, but there is a simple approach. 我无法理解您在（长）问题中写的大部分内容，但是有一个简单的方法。

We'll start with a single table. 我们将从一个表开始。 Let's say your longest Huffman code is 15 bits. 假设您最长的霍夫曼代码是15位。 (In fact, deflate limits the size of its Huffman codes to 15 bits.) Then construct a table with 32768 entries, where each entry is the number of bits in the next code, and the symbol for that code. （实际上，deflate将其霍夫曼代码的大小限制为15位。）然后构造一个包含32768个条目的表，其中每个条目是下一个代码中的位数，以及该代码的符号。 For codes less than 15 bits, there is more than one entry in the table for the same code. 对于少于15位的代码，表中对于同一代码有多个条目。 Eg if the code is 10010110 (7 bits) for the symbol 'C', then all of the indexes of the table xxxxxxxx10010110 have the same thing. 例如，如果符号'C'的代码为10010110（7位），则表xxxxxxxx10010110的所有索引都具有相同的含义。 Those entries all have {7, 'C'}. 这些条目都带有{7，'C'}。

Then you get 15 bits from the stream, and look up the next code in the table. 然后，您从流中获得15位，并查找表中的下一个代码。 You remove the number of bits from that table entry, and use the resulting symbol. 您从该表条目中删除位数，并使用结果符号。 Now you get as many bits from the stream as you need to have 15, and repeat. 现在，您需要从流中获取尽可能多的位，并具有15位，然后重复。 So if you used 7 bits, then get 8 more to get back to 15 and look up the next code. 因此，如果您使用了7位，则再增加8位回到15，然后查找下一个代码。

The next subtlety is that if your Huffman code changes often, you might end up spending more time filling up that large table for each new Huffman code than you spend actually decoding. 下一个微妙之处是，如果您的霍夫曼代码经常更改，那么您可能最终要花费更多的时间为每个新的霍夫曼代码填充大表，而不是实际解码。 To avoid that, you can make a two-level table which has, say, a 9-bit lookup (512 entries) for the first portion of the code. 为了避免这种情况，您可以创建一个两级表，该表在代码的第一部分具有9位查找（512个条目）。 If the code is 9-bits or less, then you proceed as above. 如果代码为9位或更少，则按上述步骤进行。 That will be the most common case, since shorter codes are more frequent (that being the whole point of Huffman coding). 这将是最常见的情况，因为较短的代码会更频繁（这是霍夫曼编码的重点）。 If the table entry says that there are 10 or more bits in the code (and you don't know yet how much more), then you consume the first nine bits and go to a second-level table for those initial nine bits pointed to by the entry in the first table, that has entries for the remaining six bits (64 entries). 如果表条目表明代码中有10位或更多位（并且您还不知道还有多少位），那么您将消耗前9位，并转到第二级表以查找所指向的前9位按第一个表中的条目排序，该表具有其余六位的条目（64个条目）。 That resolves the remainder of the code and so tells you how many more bits to consume and what the symbol is. 这样可以解决代码的其余部分，从而告诉您要消耗多少位以及什么是符号。 This approach can greatly reduce the time spent filling tables, and is very nearly as fast since short codes are more common. 这种方法可以极大地减少花费在填充表上的时间，并且由于短代码更为常见，因此速度几乎一样快。 This is the approach used by inflate in zlib . 这是使用的方法充气的ZLIB 。

In the end it was quite simple. 最后，这非常简单。 I support almost all solutions now. 我现在支持几乎所有解决方案。 One can test every symbol group (same bit length), use a lookup table (10bit + 10bit + 10bit (just tables of 10bit, symbolscount + 1 is the reference to those talbes)) and generating java (and if needed javascript but currently I use GWT to translate it). 可以测试每个符号组（相同的位长），使用查找表（10bit + 10bit + 10bit（仅10bit的表，symbolscount +1是对这些标签的引用））并生成Java（并且如果需要javascript，但目前我使用GWT进行翻译）。

I even use long reads and shift operations to reduce the access to binary information. 我什至使用长读取和移位操作来减少对二进制信息的访问。 This way the code gets more efficiently since I only support a maximum bit size (20bit (so a table of a table) which makes 2^20 symbols and therefore at most a million). 这样一来，由于我仅支持最大位大小（20位（即表的表）），因此使代码更有效，这使2 ^ 20个符号，因此最多为100万个符号。

For the ordering I use a generator for the bit masks just using shift operations and no requirement of reversing bit orders or such. 对于排序，我仅使用移位操作就可以使用位掩码的生成器，而无需反转位顺序等。

The table lookups can also be expressed in Java storing the tables as arrays of arrays (its interesting how big the java files can be without compilers to complain)). 表查找也可以用Java表示，将表存储为数组数组（有趣的是，不用编译器就可以抱怨Java文件有多大）。

Also I found it interesting that since comparing is expressing an ordering (half order I guess) one can sort the symbols and instead of mapping the symbols mapping the comparison index. 我还发现有趣的是，由于比较表示一种顺序（我猜是半顺序），因此可以对符号进行排序，而不是映射映射比较索引的符号。 By comparing two index one can simply sort streams of codes without touching to much. 通过比较两个索引，人们可以简单地对代码流进行排序，而无需花太多精力。 By also storing the first or first two comparison index (16 or 32bit) one can efficiently sort and therefore binary sort compressed strings using the same Huffman code, which makes it ideal to compress strings in a certain language. 通过还存储前两个或前两个比较索引（16或32位），可以使用相同的霍夫曼代码对压缩的字符串进行有效排序，从而对二进制字符串进行二进制排序，这使得以某种特定语言压缩字符串非常理想。