简体   繁体   English

使用RandomAccessFile读取单个UTF-8字符

[英]Reading a single UTF-8 character with RandomAccessFile

I've set up a sequential scanner, where a RandomAccessFile pointing to my file is able to read a single character, via the below method: 我已经设置了一个顺序扫描器,其中指向我的文件的RandomAccessFile可以通过以下方法读取单个字符:

public char nextChar() {
    try {
        seekPointer++;
        int i = source.read();
        return i > -1 ? (char) i : '\0'; // INFO: EOF character is -1.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return '\0';
}

The seekPointer is just a reference for my program, but the method stores source.read() in an int , and then returns it casted to a char if its not the end of the file. seekPointer只是我程序的引用,但是该方法将source.read()存储在int ,然后将其转换为char如果不是文件末尾)。 But these chars that I'm receiving are in ASCII format, infact its so bad that I can't even use a symbol such as ç . 但是,我收到的这些字符都是ASCII格式,实际上它是如此糟糕,以至于我什至不能使用ç等符号。

Is there a way that I can receive a single character, that is in UTF-8 format or atleast something standardised that allows more than just the ASCII character set? 有没有一种方法可以接收UTF-8格式的单个字符,或者至少允许一些标准化字符,而不仅仅是ASCII字符集?

I know I can use readUTF() but that returns an entire line as a String, which is not what I am after. 我知道我可以使用readUTF()但是它将整行返回为String,这不是我想要的。

Also, I can't simply use another stream reader, because my program requires a seek(int) function, allowing me to move back and forth in the file. 另外,我不能简单地使用另一个流读取器,因为我的程序需要一个seek(int)函数,允许我在文件中来回移动。

I'm not entirely sure what you're trying to do, but let me give you some information that might help. 我不确定您要做什么,但是让我给您一些可能有用的信息。

The UTF-8 encoding represents characters as either 1, 2, 3, or 4 bytes depending on the Unicode value of the character. UTF-8编码将字符表示为1、2、3或4个字节,具体取决于字符的Unicode值。

  • For characters 0x00-0x7F, UTF-8 encodes the character as a single byte. 对于字符0x00-0x7F,UTF-8将字符编码为单个字节。 This is a very useful property because if you're only dealing with 7-bit ASCII characters, the UTF-8 and ASCII encodings are identical. 这是一个非常有用的属性,因为如果仅处理7位ASCII字符,则UTF-8和ASCII编码是相同的。
  • For characters 0x80-0x7FF, UTF-8 uses 2 bytes: the first byte is binary 110 followed by the 5 high bits of the character, while the second byte is binary 10 followed by the 6 low bits of the character. 对于字符0x80-0x7FF,UTF-8使用2个字节:第一个字节是二进制110,后跟字符的5个高位,而第二个字节是二进制10,后跟字符的6个低位。
  • The 3- and 4-byte encodings are similar to the 2-byte encoding, except that the first byte of the 3-byte encoding starts with 1110 and the first byte of the 4-byte encoding starts with 11110. 3字节和4字节编码类似于2字节编码,不同之处在于3字节编码的第一个字节以1110开头,而4字节编码的第一个字节以11110开头。
  • See Wikipedia for all the details. 有关所有详细信息,请参见Wikipedia

Now this may seem pretty byzantine but the upshot of it is this: you can read any byte in a UTF-8 file and know whether you're looking at a standalone character, the first byte of a multibyte character, or one of the other bytes of a multibyte character. 现在,这看起来似乎很拜占庭,但是它的结果是:您可以读取UTF-8文件中的任何字节,并知道您是在查看独立字符,多字节字符的第一个字节还是另一个多字节字符的字节数。

If the byte you read starts with binary 0, you're looking at a single-byte character. 如果读取的字节以二进制0开头,则说明您正在查看一个单字节字符。 If it starts with 110, 1110, or 11110, then you have the first byte of a multibyte character of 2, 3, or 4 bytes, respectively. 如果以110、1110或11110开头,则您的多字节字符的第一个字节分别为2、3或4个字节。 If it starts with 10, then it's one of the subsequent bytes of a multibyte character; 如果以10开头,则为多字节字符的后续字节之一; scan backwards to find the start of it. 向后扫描以查找开始。

So if you want to let your caller seek to any random position in a file and read the UTF-8 character there, you can just apply the algorithm above to find the first byte of that character (if it's not the one at the specified position) and then read and decode the value. 因此,如果您想让调用者查找文件中的任意位置并读取其中的UTF-8字符,则只需应用上述算法即可找到该字符的第一个字节(如果不是指定位置的那个字节) ),然后读取并解码该值。

See the Java Charset class for a method to decode UTF-8 from the source bytes. 有关从源字节解码UTF-8的方法,请参见Java Charset类。 There may be easier ways but Charset will work. 也许有更简单的方法,但是Charset可以工作。

Update: This code should handle the 1- and 2-byte UTF-8 cases. 更新:此代码应处理1字节和2字节UTF-8情况。 Not tested at all, YMMV. 尚未经过测试,YMMV。

for (;;) {
    int b = source.read();
    // Single byte character starting with binary 0.
    if ((b & 0x80) == 0)
        return (char) b;
    // 2-byte character starting with binary 110.
    if ((b & 0xE0) == 0xC0)
        return (char) ((b & 0x1F) << 6 | source.read() & 0x3F);
    // 3 and 4 byte encodings left as an exercise...
    // 2nd, 3rd, or 4th byte of a multibyte char starting with 10. 
    // Back up and loop.
    if ((b & 0xC0) == 0xF0) 
        source.seek(source.getFilePosition() - 2);
}

I wouldn't bother with seekPointer. 我不会打扰seekPointer。 The RandomAccessFile knows what it is; RandomAccessFile知道它是什么。 just call getFilePosition when you need it. 只需在需要时调用getFilePosition。

Building from Willis Blackburn's answer, I can simply do some integer checks to make sure that they exceed a certain number, to get the amount of characters I need to check ahead. 从威利斯·布莱克本(Willis Blackburn)的答案出发,我可以简单地进行一些整数检查,以确保它们超过一定数量,以获取需要提前检查的字符数。

Judging by the following table: 从下表判断:

first byte starts with 0                         1 byte char
first byte starts with 10    >= 128 && <= 191    ? byte(s) char
first byte starts with 11        >= 192          2 bytes char
first byte starts with 111       >= 224          3 bytes char
first byte starts with 1111      >= 240          4 bytes char

We can check the integer read from RandomAccessFile.read() by comparing it against the numbers in the middle column, which are literally just the integer representations of a byte. 通过将其与中间列中的数字进行比较,我们可以检查从RandomAccessFile.read()读取的整数,这些数字实际上只是一个字节的整数表示形式。 This allows us to skip byte conversion completely, saving time. 这使我们可以完全跳过字节转换,从而节省时间。

The following code, will read a character from a RandomAccessFile, with a byte-length of 1-4: 下面的代码将从一个RandomAccessFile中读取一个字符,其字节长度为1-4:

int seekPointer = 0;
RandomAccessFile source; // initialise in your own way

public void seek(int shift) {
    seekPointer += shift;
    if (seekPointer < 0) seekPointer = 0;
    try {
        source.seek(seekPointer);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

private int byteCheck(int chr) {
    if (chr == -1) return 1; // eof
    int i = 1; // theres always atleast one byte
    if (chr >= 192) i++; // 2 bytes
    if (chr >= 224) i++; // 3 bytes
    if (chr >= 240) i++; // 4 bytes
    if (chr >= 128 && chr <= 191) i = -1; // woops, we're halfway through a char!
    return i;
}

public char nextChar() {
    try {
        seekPointer++;
        int i = source.read();

        if (byteCheck(i) == -1) {
            boolean malformed = true;
            for (int k = 0; k < 4; k++) { // Iterate 3 times.
                // we only iterate 3 times because the maximum size of a utf-8 char is 4 bytes.
                // any further and we may possibly interrupt the other chars.
                seek(-1);
                i = source.read();
                if (byteCheck(i) != -1) {
                    malformed = false;
                    break;
                }
            }
            if (malformed) {
                seek(3);
                throw new UTFDataFormatException("Malformed UTF char at position: " + seekPointer);
            }
        }

        byte[] chrs = new byte[byteCheck(i)];
        chrs[0] = (byte) i;

        for (int j = 1; j < chrs.length; j++) {
            seekPointer++;
            chrs[j] = (byte) source.read();
        }

        return i > -1 ? new String(chrs, Charset.forName("UTF-8")).charAt(0) : '\0'; // EOF character is -1.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return '\0';
}

From the case statement in java.io.DataInputStream.readUTF(DataInput) you can derive something like java.io.DataInputStream.readUTF(DataInput)的case语句,您可以派生出类似

public static char readUtf8Char(final DataInput dataInput) throws IOException {
    int char1, char2, char3;

    char1 = dataInput.readByte() & 0xff;
    switch (char1 >> 4) {
        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
            /* 0xxxxxxx*/
            return (char)char1;
        case 12: case 13:
            /* 110x xxxx   10xx xxxx*/
            char2 = dataInput.readByte() & 0xff;
            if ((char2 & 0xC0) != 0x80) {
                throw new UTFDataFormatException("malformed input");
            }
            return (char)(((char1 & 0x1F) << 6) | (char2 & 0x3F));
        case 14:
            /* 1110 xxxx  10xx xxxx  10xx xxxx */
            char2 = dataInput.readByte() & 0xff;
            char3 = dataInput.readByte() & 0xff;
            if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80)) {
                throw new UTFDataFormatException("malformed input");
            }
            return (char)(((char1 & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0));
        default:
            /* 10xx xxxx,  1111 xxxx */
            throw new UTFDataFormatException("malformed input");
    }
}

Note that RandomAccessFile implements DataInput hence you can pass it to the above method. 请注意, RandomAccessFile实现了DataInput因此您可以将其传递给上述方法。 Before calling it for the first character you need to read an unsigned short which represents the UTF string length. 在为第一个字符调用它之前,您需要阅读一个无符号缩写,它表示UTF字符串长度。

Note that the encoding used here is modified-UTF-8 as described in the Javadoc of DataInput. 请注意,如DataInput的Javadoc中所述,此处使用的编码是经过修改的UTF-8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM