简体   繁体   中英

Reading a single UTF-8 character with RandomAccessFile

I've set up a sequential scanner, where a RandomAccessFile pointing to my file is able to read a single character, via the below method:

public char nextChar() {
    try {
        seekPointer++;
        int i = source.read();
        return i > -1 ? (char) i : '\0'; // INFO: EOF character is -1.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return '\0';
}

The seekPointer is just a reference for my program, but the method stores source.read() in an int , and then returns it casted to a char if its not the end of the file. But these chars that I'm receiving are in ASCII format, infact its so bad that I can't even use a symbol such as ç .

Is there a way that I can receive a single character, that is in UTF-8 format or atleast something standardised that allows more than just the ASCII character set?

I know I can use readUTF() but that returns an entire line as a String, which is not what I am after.

Also, I can't simply use another stream reader, because my program requires a seek(int) function, allowing me to move back and forth in the file.

I'm not entirely sure what you're trying to do, but let me give you some information that might help.

The UTF-8 encoding represents characters as either 1, 2, 3, or 4 bytes depending on the Unicode value of the character.

  • For characters 0x00-0x7F, UTF-8 encodes the character as a single byte. This is a very useful property because if you're only dealing with 7-bit ASCII characters, the UTF-8 and ASCII encodings are identical.
  • For characters 0x80-0x7FF, UTF-8 uses 2 bytes: the first byte is binary 110 followed by the 5 high bits of the character, while the second byte is binary 10 followed by the 6 low bits of the character.
  • The 3- and 4-byte encodings are similar to the 2-byte encoding, except that the first byte of the 3-byte encoding starts with 1110 and the first byte of the 4-byte encoding starts with 11110.
  • See Wikipedia for all the details.

Now this may seem pretty byzantine but the upshot of it is this: you can read any byte in a UTF-8 file and know whether you're looking at a standalone character, the first byte of a multibyte character, or one of the other bytes of a multibyte character.

If the byte you read starts with binary 0, you're looking at a single-byte character. If it starts with 110, 1110, or 11110, then you have the first byte of a multibyte character of 2, 3, or 4 bytes, respectively. If it starts with 10, then it's one of the subsequent bytes of a multibyte character; scan backwards to find the start of it.

So if you want to let your caller seek to any random position in a file and read the UTF-8 character there, you can just apply the algorithm above to find the first byte of that character (if it's not the one at the specified position) and then read and decode the value.

See the Java Charset class for a method to decode UTF-8 from the source bytes. There may be easier ways but Charset will work.

Update: This code should handle the 1- and 2-byte UTF-8 cases. Not tested at all, YMMV.

for (;;) {
    int b = source.read();
    // Single byte character starting with binary 0.
    if ((b & 0x80) == 0)
        return (char) b;
    // 2-byte character starting with binary 110.
    if ((b & 0xE0) == 0xC0)
        return (char) ((b & 0x1F) << 6 | source.read() & 0x3F);
    // 3 and 4 byte encodings left as an exercise...
    // 2nd, 3rd, or 4th byte of a multibyte char starting with 10. 
    // Back up and loop.
    if ((b & 0xC0) == 0xF0) 
        source.seek(source.getFilePosition() - 2);
}

I wouldn't bother with seekPointer. The RandomAccessFile knows what it is; just call getFilePosition when you need it.

Building from Willis Blackburn's answer, I can simply do some integer checks to make sure that they exceed a certain number, to get the amount of characters I need to check ahead.

Judging by the following table:

first byte starts with 0                         1 byte char
first byte starts with 10    >= 128 && <= 191    ? byte(s) char
first byte starts with 11        >= 192          2 bytes char
first byte starts with 111       >= 224          3 bytes char
first byte starts with 1111      >= 240          4 bytes char

We can check the integer read from RandomAccessFile.read() by comparing it against the numbers in the middle column, which are literally just the integer representations of a byte. This allows us to skip byte conversion completely, saving time.

The following code, will read a character from a RandomAccessFile, with a byte-length of 1-4:

int seekPointer = 0;
RandomAccessFile source; // initialise in your own way

public void seek(int shift) {
    seekPointer += shift;
    if (seekPointer < 0) seekPointer = 0;
    try {
        source.seek(seekPointer);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

private int byteCheck(int chr) {
    if (chr == -1) return 1; // eof
    int i = 1; // theres always atleast one byte
    if (chr >= 192) i++; // 2 bytes
    if (chr >= 224) i++; // 3 bytes
    if (chr >= 240) i++; // 4 bytes
    if (chr >= 128 && chr <= 191) i = -1; // woops, we're halfway through a char!
    return i;
}

public char nextChar() {
    try {
        seekPointer++;
        int i = source.read();

        if (byteCheck(i) == -1) {
            boolean malformed = true;
            for (int k = 0; k < 4; k++) { // Iterate 3 times.
                // we only iterate 3 times because the maximum size of a utf-8 char is 4 bytes.
                // any further and we may possibly interrupt the other chars.
                seek(-1);
                i = source.read();
                if (byteCheck(i) != -1) {
                    malformed = false;
                    break;
                }
            }
            if (malformed) {
                seek(3);
                throw new UTFDataFormatException("Malformed UTF char at position: " + seekPointer);
            }
        }

        byte[] chrs = new byte[byteCheck(i)];
        chrs[0] = (byte) i;

        for (int j = 1; j < chrs.length; j++) {
            seekPointer++;
            chrs[j] = (byte) source.read();
        }

        return i > -1 ? new String(chrs, Charset.forName("UTF-8")).charAt(0) : '\0'; // EOF character is -1.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return '\0';
}

From the case statement in java.io.DataInputStream.readUTF(DataInput) you can derive something like

public static char readUtf8Char(final DataInput dataInput) throws IOException {
    int char1, char2, char3;

    char1 = dataInput.readByte() & 0xff;
    switch (char1 >> 4) {
        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
            /* 0xxxxxxx*/
            return (char)char1;
        case 12: case 13:
            /* 110x xxxx   10xx xxxx*/
            char2 = dataInput.readByte() & 0xff;
            if ((char2 & 0xC0) != 0x80) {
                throw new UTFDataFormatException("malformed input");
            }
            return (char)(((char1 & 0x1F) << 6) | (char2 & 0x3F));
        case 14:
            /* 1110 xxxx  10xx xxxx  10xx xxxx */
            char2 = dataInput.readByte() & 0xff;
            char3 = dataInput.readByte() & 0xff;
            if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80)) {
                throw new UTFDataFormatException("malformed input");
            }
            return (char)(((char1 & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0));
        default:
            /* 10xx xxxx,  1111 xxxx */
            throw new UTFDataFormatException("malformed input");
    }
}

Note that RandomAccessFile implements DataInput hence you can pass it to the above method. Before calling it for the first character you need to read an unsigned short which represents the UTF string length.

Note that the encoding used here is modified-UTF-8 as described in the Javadoc of DataInput.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM