简体   繁体   English

java.io.DataOutputStream 中的 writeUTF

[英]writeUTF in java.io.DataOutputStream

I know that a symbol with utf8 needs 1-4 bytes in Java.我知道带有 utf8 的符号在 Java 中需要 1-4 个字节。 But when I used the method readUTF/writeUTF in java.io.DataInputStream/DataOutputStream, I found that the method just solve the situation about a symbol needs 1-3 bytes.但是当我使用java.io.DataInputStream/DataOutputStream中的readUTF/writeUTF方法时,我发现该方法只是解决了一个符号需要1-3个字节的情况。

static int writeUTF(String str, DataOutput out) throws IOException {
    int strlen = str.length();
    int utflen = 0;
    int c, count = 0;

   /* use charAt instead of copying String to char array */
    for (int i = 0; i < strlen; i++) {
        c = str.charAt(i);
        if ((c >= 0x0001) && (c <= 0x007F)) {
            utflen++;
        } else if (c > 0x07FF) {
            utflen += 3;
        } else {
            utflen += 2;
        }
    }

    if (utflen > 65535)
        throw new UTFDataFormatException(
            "encoded string too long: " + utflen + " bytes");>

    byte[] bytearr = null;
    if (out instanceof DataOutputStream) {
        DataOutputStream dos = (DataOutputStream)out;
        if(dos.bytearr == null || (dos.bytearr.length < (utflen+2)))
            dos.bytearr = new byte[(utflen*2) + 2];
        bytearr = dos.bytearr;
    } else {
        bytearr = new byte[utflen+2];
    }

    bytearr[count++] = (byte) ((utflen >>> 8) & 0xFF);
    bytearr[count++] = (byte) ((utflen >>> 0) & 0xFF);

    int i=0;
    for (i=0; i<strlen; i++) {
       c = str.charAt(i);
       if (!((c >= 0x0001) && (c <= 0x007F))) break;
       bytearr[count++] = (byte) c;
    }

    for (;i < strlen; i++){
        c = str.charAt(i);
        if ((c >= 0x0001) && (c <= 0x007F)) {
            bytearr[count++] = (byte) c;

        } else if (c > 0x07FF) {
            bytearr[count++] = (byte) (0xE0 | ((c >> 12) & 0x0F));
            bytearr[count++] = (byte) (0x80 | ((c >>  6) & 0x3F));
            bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
        } else {
            bytearr[count++] = (byte) (0xC0 | ((c >>  6) & 0x1F));
            bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
        }
    }
    out.write(bytearr, 0, utflen+2);
    return utflen + 2;
}

Why not solve the situation when a symbol needs 4 bytes?为什么不解决一个符号需要 4 个字节的情况呢?

It's all explained in the docs, though you have to go through an extra click.这一切都在文档中进行了解释,尽管您必须进行额外的点击。

The docs for DataOutputStream#writeUTF mentions that it uses a " modified UTF-8 encoding." DataOutputStream#writeUTF的文档提到它使用“ 修改后的 UTF-8编码”。 That link is in the original JavaDocs (I didn't just add it for this answer), and if you follow it, you get a page explaining that encoding.该链接位于原始 JavaDocs 中(我不只是为此答案添加了它),如果您关注它,您将获得一个解释该编码的页面。 Note specifically the part near the bottom of the summary (before you get into the method summary section):特别注意摘要底部附近的部分(在进入方法摘要部分之前):

The differences between this format and the standard UTF-8 format are the following:此格式与标准 UTF-8 格式的区别如下:

... ...

• Only the 1-byte, 2-byte, and 3-byte formats are used. • 仅使用 1 字节、2 字节和 3 字节格式。

So, while you're right in thinking that UTF-8 uses up to 4 bytes, writeUTF uses a modified version, and one of the modifications is that it only supports up to 3 bytes.因此,虽然您认为 UTF-8 最多使用 4 个字节是正确的,但writeUTF使用修改后的版本,其中一项修改是它最多只支持 3 个字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 java.io.DataOutputStream类中的writeChar和writeShort方法有什么区别? - What is the difference between the method writeChar and writeShort in java.io.DataOutputStream class? 在线程中循环 DataOutputStream writeUTF - Looping DataOutputStream writeUTF in thread android socket DataOutputStream.writeUTF - android socket DataOutputStream.writeUTF 从DataOutputStream.writeUTF()读取时如何获取Java字符串的“原始”字节? - How to get 'original' bytes of a Java String when read from DataOutputStream.writeUTF()? 通过DataOutputStream.writeUTF()发送信息时出现奇怪的字符 - strange characters when sending information via DataOutputStream.writeUTF() 从DataOutputStream重新分配给BufferedOutputStream和FileOutputStream之后,再也无法使用writeUTF() - Not able to use writeUTF() anymore after reassigning to BufferedOutputStream and FileOutputStream from DataOutputStream Hadoop和jgit在java.io.file和dataoutputstream之间转换 - Hadoop and jgit convert between java.io.file and dataoutputstream 为什么 DataOutputStream.writeUTF() 在开头添加额外的 2 个字节? - Why does DataOutputStream.writeUTF() add additional 2 bytes at the beginning? java socket writeUTF()和readUTF() - java socket writeUTF() and readUTF() Java中使用writeUTF和readUTF的意外值 - Unexpected values with writeUTF and readUTF in Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM