简体   繁体   中英

Huffman Coding - Dealing with unicode

I've implemented a Huffman coding in java, that works on byte data from an input file. However, it only works when compressing ascii. I'd like to extend it so that it can deal with characters that are larger than 1 byte long, but I'm not sure how to do this exactly.

private static final int CHARS = 256;     
private int [] getByteFrequency(File f) throws FileNotFoundException {
    try {
        FileInputStream fis = new FileInputStream(f);
        byte [] bb = new byte[(int) f.length()];
        int [] aa = new int[CHARS];
            if(fis.read(bb) == bb.length) {
                System.out.print("Uncompressed data: ");
                for(int i = 0; i < bb.length; i++) {
                        System.out.print((char) bb[i]);
                        aa[bb[i]]++;
                }
                System.out.println();
            }
        return aa;
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); }
    return null;
}

For example, this is what I'm using to get the frequency of the characters in the file, and obviously it only works on a single byte. If I give it a unicode file, I get an ArrayIndexOutOfBoundsException at aa[bb[i]]++; , and i is normally a negative number. I know this is because aa[bb[i]]++; is only looking at one byte, and the unicode character will be more than one, but I'm not sure on how I can change it.

Can anybody give me some pointers?

Try the following:

private static final int CHARS = 256;     
private int [] getByteFrequency(File f) throws FileNotFoundException {
    try {
        FileInputStream fis = new FileInputStream(f);
        byte [] bb = new byte[(int) f.length()];
        int [] aa = new int[CHARS];
            if(fis.read(bb) == bb.length) {
                System.out.print("Uncompressed data: ");
                for(int i = 0; i < bb.length; i++) {
                        System.out.print((char) bb[i]);
                        aa[((int)bb[i])&0xff]++;
                }
                System.out.println();
            }
        return aa;
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); }
    return null;
}

If i'm correct (I haven't tested it), your problem is that byte is a SIGNED value in java. The cast to integer + masking it to 0xff should handle it correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM