Huffman Coding - Dealing with unicode

Question

I've implemented a Huffman coding in java, that works on byte data from an input file. However, it only works when compressing ascii. I'd like to extend it so that it can deal with characters that are larger than 1 byte long, but I'm not sure how to do this exactly.

private static final int CHARS = 256;     
private int [] getByteFrequency(File f) throws FileNotFoundException {
    try {
        FileInputStream fis = new FileInputStream(f);
        byte [] bb = new byte[(int) f.length()];
        int [] aa = new int[CHARS];
            if(fis.read(bb) == bb.length) {
                System.out.print("Uncompressed data: ");
                for(int i = 0; i < bb.length; i++) {
                        System.out.print((char) bb[i]);
                        aa[bb[i]]++;
                }
                System.out.println();
            }
        return aa;
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); }
    return null;
}

For example, this is what I'm using to get the frequency of the characters in the file, and obviously it only works on a single byte. If I give it a unicode file, I get an ArrayIndexOutOfBoundsException at aa[bb[i]]++; , and i is normally a negative number. I know this is because aa[bb[i]]++; is only looking at one byte, and the unicode character will be more than one, but I'm not sure on how I can change it.

Can anybody give me some pointers?

Answer 1

Try the following:

private static final int CHARS = 256;     
private int [] getByteFrequency(File f) throws FileNotFoundException {
    try {
        FileInputStream fis = new FileInputStream(f);
        byte [] bb = new byte[(int) f.length()];
        int [] aa = new int[CHARS];
            if(fis.read(bb) == bb.length) {
                System.out.print("Uncompressed data: ");
                for(int i = 0; i < bb.length; i++) {
                        System.out.print((char) bb[i]);
                        aa[((int)bb[i])&0xff]++;
                }
                System.out.println();
            }
        return aa;
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); }
    return null;
}

If i'm correct (I haven't tested it), your problem is that byte is a SIGNED value in java. The cast to integer + masking it to 0xff should handle it correctly.

Huffman Coding - Dealing with unicode

Question

1 answers

solution1
0 ACCPTED 2012-11-01 09:42:05

Huffman Coding - Dealing with unicode

Question

1 answers

solution1 0 ACCPTED 2012-11-01 09:42:05

solution1
0 ACCPTED 2012-11-01 09:42:05