简体   繁体   English

为什么我们需要霍夫曼树代码的unsigned char

[英]Why do we need unsigned char for Huffman tree code

I am trying to create a Huffman tree the question I read is very strange for me, it is as follows: 我正在尝试创建一个霍夫曼树,我读到的问题对我来说很奇怪,它如下:

Given the following data structure: 鉴于以下数据结构:

 struct huffman { unsigned char sym; /* symbol */ struct huffman *left, *right; /* left and right subtrees */ }; 

write a program that takes the name of a binary file as sole argument, builds the Huffman tree of that file assuming that atoms (elementary symbols) are 8-bit unsigned characters, and prints the tree as well as the dictionary. 编写一个以二进制文件名作为唯一参数的程序,假设原子(基本符号)是8位无符号字符,构建该文件的霍夫曼树,并打印树和字典。
allocations must be done using nothing else than malloc(), and sorting can be done using qsort(). 必须使用除malloc()之外的其他任何操作来完成分配,并且可以使用qsort()完成排序。

Here the thing which confuses me is that to write a program to create a huffman tree we just need to do following things: 这让我感到困惑的是,编写一个程序来创建一个霍夫曼树,我们只需要做以下事情:

  1. We need to take a frequency array (That could be Farray[]={.......} ) 我们需要一个频率数组(可能是Farray[]={.......}
  2. Sort it and add the two smallest nodes to form a tree until it don't left 1 final node(which is head). 对它进行排序并添加两个最小的节点以形成树,直到它没有离开1个最终节点(即头部)。

Now the question is here: why and where do we need those unsigned char data? 现在的问题是:我们为什么以及在哪里需要那些未签名的char数据? (what type of unsigned char data this question want, I think only frequency is enough to display a Huffman tree)? (这个问题想要什么类型的unsigned char数据,我认为只有频率足以显示一个Huffman树)?

If you purely want to display the shape of the tree, then yes, you just need to build it. 如果你纯粹想要显示树的形状 ,那么是的,你只需要构建它。 However, for it to be of any use whatsoever you need to know what original symbol each node represents. 但是,对于任何用途,您需要知道每个节点代表什么原始符号。

Imagine your input symbols are [ABCD]. 想象一下你的输入符号是[ABCD]。 An imaginary Huffman tree/dictionary might look like this: 想象中的霍夫曼树/字典可能如下所示:

         ( )
        /   \              A = 1
      ( )   (A)            B = 00
     /   \                 C = 010
   (B)   ( )               D = 011
        /   \
      (C)   (D)

If you don't store sym , it looks like this: 如果你不存储sym ,它看起来像这样:

         ( )
        /   \              A = ?
      ( )   ( )            B = ?
     /   \                 C = ?
   ( )   ( )               D = ?
        /   \
      ( )   ( )

Not very useful, that, is it? 不是很有用,那是吗?

Edit 2: The missing step in the plan is step 0: build the frequency array from the file (somehow I missed that you don't need to actually encode the file too). 编辑2:计划中缺少的步骤是步骤0:从文件构建频率数组(不知怎的,我错过了你不需要实际编码文件)。 This isn't part of the actual Huffman algorithm itself and I couldn't find a decent example to link to, so here's a rough idea: 这不是实际的霍夫曼算法本身的一部分,我找不到一个合适的例子来链接,所以这里有一个粗略的想法:

FILE *input = fopen("inputfile", "rb");
int freq[256] = {0};
int c;
while ((c = fgetc(input)) != EOF)
    freq[c]++;
fclose(input);

/* do Huffman algorithm  */
...

Now, that still needs improving since it neither uses malloc() nor takes a filename as an argument, but it's not my homework ;) 现在,仍然需要改进,因为它既不使用malloc()也不使用文件名作为参数,但它不是我的功课;)

It's a while since I did this, but I think the generated "dictionary" is required to encode data, while the "tree" is used to decode it. 我这样做了一段时间,但我认为生成的“字典”需要数据进行编码 ,而“树”则用于对其进行解码 Of course, you can always build one from the other. 当然,你总是可以从另一个构建一个。

While decoding, you traverse the tree (left/right, according to successive input bits), and when you hit a terminal node (null pointer) then the 'sym' in the node is the output value. 在解码时,您遍历树(左/右,根据连续的输入位),当您点击终端节点(空指针)时,节点中的'sym'是输出值。

Usually data compression is divided into 2 big steps; 通常数据压缩分为两大步骤; given a stream of data: 给定一个数据流:

  • evaluate the probability that a given symbol will appear in the stream, in other words you evaluate how frequent a symbol appears in a dataset 评估给定符号在流中出现的概率,换句话说,您可以评估符号在数据集中出现的频率
  • once you have studied the occurences and created your table with symbols associated with a probability, you need to encode the symbols according their probability, to achieve this magic you create a dictionary were the original symbol is often times just replaced with another symbol that is much smaller in size, especially true for symbols that are frequently used in the dataset, the dictionary keeps track of this substitutions for both the encoding and decoding phase. 一旦你研究了出现并用符合概率的符号创建你的表,你需要根据它们的概率对符号进行编码,为了实现这个魔法,你创建了一个字典,原来的符号往往只是用另一个符号代替了大小越小,特别是对于数据集中经常使用的符号,字典会跟踪编码和解码阶段的这种替换。 Hoffman gives you an algorithm to automate this process and get a fairly good result. 霍夫曼为您提供了一种自动化此过程的算法,并获得了相当不错的结果。

In practice it's a little bit more complicated than this, because trees are involved, but the main purpose is always to build the dictionary. 在实践中它比这更复杂,因为树涉及,但主要目的始终是构建字典。

There is a complete tutorial here . 这里有一个完整的教程

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM