简体   繁体   English

如何读取二进制文件来计算霍夫曼树的频率?

[英]How to read a binary file to calculate frequency of Huffman tree?

I have to calculate frequency of Huffman tree from a "binary file" as sole argument. 我必须从“二进制文件”中计算霍夫曼树的频率作为唯一参数。 I have a doubt that binary files are the files which contains "0" and "1" only. 我怀疑二进制文件是仅包含“0”和“1”的文件。

Whereas frequency is the repetition of the number of alphabets (eg, abbacdd here freq of a=2, b=2 ,c=1, d=2). 而频率是字母数量的重复(例如,abbacdd,其中a = 2的频率,b = 2,c = 1,d = 2)。 And my structure must be like this: 我的结构必须是这样的:

struct Node
{
unsigned char symbol;   /* the symbol or alphabets */
int freq;               /* related frequency */
struct Node *left,*right; /* Left and right leafs */
};

But i not at all understand how can i get the symbol and from ".bin" file (which consists of only "0" and "1") ? 但我根本不明白如何从“.bin”文件(仅包含“0”和“1”)获取符号?

When i try to see the contents of a file i get: 当我尝试查看文件的内容时,我得到:

hp@ubuntu:~/Desktop/Internship_Xav/Huf_pointer$ xxd -b out.bin 
0000000: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000006: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000000c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000012: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000018: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000001e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000024: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000002a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000030: 00000000 00000000 00000000 00000000 00000000 00000000  ......
.........//Here also there is similar kind of data    ................
00008ca: 00010011 00010011 00010011 00010011 00010011 00010011  ......
00008d0: 00010011 00010011 00010011 00010011 00010011 00010011  ......
00008d6: 00010011 00010011 00010011 00010011 00010011 00010011  ..... 

So , I not at all understand where are the frequencies and where are the symbols. 所以, 我根本不了解频率在哪里以及符号在哪里。 How to store the symbols and how to calculate frequencies. 如何存储符号以及如何计算频率。 Actually after having frequencies and symbols i will create HUffman tree using it. 实际上,在有频率和符号之后,我将使用它创建HUffman树。

First, you need to create some sort of frequency table. 首先,您需要创建某种频率表。
You could use a std::map . 你可以使用std::map
You would do something like this: 你会做这样的事情:

#include <algorithm>
#include <fstream>
#include <map>
#include <string>

std::map <unsigned char, int> CreateFrequencyTable (const std::string &strFile)
{
    std::map <unsigned char, int> char_freqs ; // character frequencies

    std::ifstream file (strFile) ;

    int next = 0 ;
    while ((next = file.get ()) != EOF) {
        unsigned char uc = static_cast <unsigned char> (next) ;

        std::map <unsigned char, int>::iterator iter ;
        iter = char_freqs.find (uc) ;

        // This character is in our map.
        if (iter != char_freqs.end ()) {
            iter->second += 1 ;
        }

        // This character is not in our map yet.
        else {
            char_freqs [uc] = 1 ;
        }
    }

    return char_freqs ;
}

Then you could use this function like this: 然后你可以像这样使用这个函数:

std::map <unsigned char, int> char_freqs = CreateFrequencyTable ("file") ;

You can obtain the element with the highest frequency like this: 您可以获得频率最高的元素,如下所示:

std::map <unsigned char, int>::iterator iter = std::max_element (
    char_freqs.begin (), 
    char_freqs.end (), 
    std::map <unsigned char, int>::value_comp
) ;

Then you would need to build your Huffman tree. 然后你需要建立你的霍夫曼树。
Remember that the characters are all leaf nodes, so you need a way to differentiate the leaf nodes from the non-leaf nodes. 请记住,字符都是叶节点,因此您需要一种方法来区分叶节点和非叶节点。

Update 更新

If reading a single character from the file is too slow, you could always load all of the contents into a vector like this: 如果从文件中读取单个字符太慢,您可以始终将所有内容加载到这样的向量中:

// Make sure to #include <iterator>
std::ifstream file ("test.txt") ;
std::istream_iterator <unsigned char> begin = file ;
std::vector<unsigned char> vecBuffer (begin, std::istream_iterator <unsigned char> ()) ;

You would still need to create a frequency table. 您仍然需要创建频率表。

A symbol in a huffman tree could be anything, 霍夫曼树中的符号可以是任何东西,
but as you have to use an unsigned char per symbol 但是你必须为每个符号使用一个unsigned char
you should probably take a byte? 你应该拿一个字节?
So no, not only 0 or 1, but eight time 0 or 1 together. 所以不,不仅是0或1,而是八次0或1。

Like 00010011 somewhere in your output of xxd 就像输出xxd某处的00010011一样
xxd -b will just give you eight 0/1 per byte. xxd -b将为每个字节提供8个0/1。
You could write a number between 0 and 255 as well, 你也可以写一个0到255之间的数字,
or two times one character of 0123456789abcdef 或两次一个字符0123456789abcdef
There are lots of possibilies how to show a byte on the screen, 如何在屏幕上显示一个字节有很多可能性,
but that does not matter at all. 但这根本不重要。

If you know how to read the content of a file in C/C++, 如果您知道如何在C / C ++中读取文件的内容,
just read unsigned char until the file ends 只需读取unsigned char直到文件结束
and count which value is how often in there. 并计算那里的值是多少。 That´s all. 就这样。

As you´re probably writing decimal numbers in your program code, 因为您可能在程序代码中写入十进制数字,
there are 256 different values (0,1,2...255). 有256个不同的值(0,1,2 ... 255)。
So you will need 256 integers (in an array, or your Node struct...) 所以你需要256个整数(在一个数组中,或你的Node结构......)
to count how often each value appears. 计算每个值出现的频率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM