简体   繁体   中英

How to convert the "Java modified UTF-8" to the regular UTF-8 and back?

I have created a Java wrapper around a native C library and have a question about the string encodings. There are slight differences in the “Java modified UTF-8” encoding that is used by Java from the regular UTF-8. And these differences may cause serious problems: the JNI functions may crash the app when passed the regular UTF-8 because it may contain byte sequences forbidden for the “Java modified UTF-8”. Please see the following topic: What does it mean to say "Java Modified UTF-8 Encoding"?

My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?

My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?

First, consider whether you really need or want to do that. The only reason I can think of for doing so in the context of wrapping a C library is to use the JNI functions that work with Java String s in terms of byte arrays encoded in modified UTF-8, but that's neither the only nor the best way to proceed except in rather specific circumstances.

For most cases, I would recommend going directly from UTF-8 to String objects, and getting Java to do most of that work. Simple tools Java provides for that include the constructor String(byte[], String) , which initializes a String with data whose encoding you specify, and String.getBytes(String) , which gives you the string's character data in the encoding of your choice. Both of these are limited to encodings known to the JVM, but UTF-8 is guaranteed to be among those. You can use those directly from your JNI code, or provide suitable for-purpose wrapper methods for your JNI code to invoke.

If you really do want the modified UTF-8 form for its own sake, then your JNI code can obtain it from the corresponding Java string (obtained as summarized above) via the GetStringUTFChars JNI function, and you can go the other way with NewStringUTF . Of course, this makes Java String s the intermediate form, which is entirely appropos in this case.

Thanks everyone for your replies! I finally found the answer. The only documented way of such conversions is using InputStreamReader and OutputStreamWriter

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter (if it is the platform's default character set or as requested by the program).

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Also the NewStringUTF JNI method expects the Modified UTF-8 input, not the standard one. And it will crash the app if it receives a forbidden byte sequence and the JNI Exception handling can't prevent it from crashing the app.

So my second conclusion is that passing String/jstring from JNI to Java or the other way is always a bad idea. Never do that. Perform all of the conversions with the InputStreamReader and OutputStreamWriter on the Java layer and pass the raw byte arrays to/from the JNI.

There is absolutely nothing that you can only do using some library call. You can always do it yourself.

Note: class Buffer below just wraps an array of byte the same way a String wraps an array of char .

public static String stringFromBuffer( Buffer buffer )
{
    String result = stringFromBuffer0( buffer );
    assert bufferFromString0( result ).equals( buffer );
    return result;
}

private static String stringFromBuffer0( Buffer buffer )
{
    byte[] bytes = buffer.getBytes();
    int end = bytes.length;
    char[] chars = new char[end];
    int t = 0;
    for( int s = 0; s < end; )
    {
        int b1 = bytes[s++] & 0xff;
        assert b1 >> 4 >= 0;
        if( /*b1 >> 4 >= 0 &&*/ b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            chars[t++] = (char)b1;
        else if( b1 >> 4 >= 8 && b1 >> 4 <= 11 ) /* 0x10xx_xxxx */
            throw new MalformedUtf8Exception( s - 1 );
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x1f) << 6) | (b2 & 0x3f));
        }
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b3 = bytes[s++] & 0xff;
            assert (b3 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x0f) << 12) | ((b2 & 0x3f) << 6) | (b3 & 0x3f));
        }
        else /* 0x1111_xxxx */
            throw new MalformedUtf8Exception( s - 1 );
    }
    return new String( chars, 0, t );
}

private static Buffer bufferFromString( String s )
{
    Buffer result = bufferFromString0( s );
    assert stringFromBuffer( result ).equals( s );
    return result;
}

private static Buffer bufferFromString0( String s )
{
    char[] chars = s.toCharArray();
    byte[] bytes = new byte[chars.length * 3];
    int p = 0;
    for( char c : chars )
    {
        if( (c >= 1) && (c <= 0x7f) )
            bytes[p++] = (byte)c;
        else if( c > 0x07ff )
        {
            bytes[p++] = (byte)(0xe0 | ((c >> 12) & 0x0f));
            bytes[p++] = (byte)(0x80 | ((c >> 6) & 0x3f));
            bytes[p++] = (byte)(0x80 | (c & 0x3f));
        }
        else
        {
            bytes[p++] = (byte)(0xc0 | ((c >> 6) & 0x1f));
            bytes[p++] = (byte)(0x80 | (c & 0x3f));
        }
    }
    if( p > 0xffff )
        throw new StringTooLongException( p );
    return Buffer.create( bytes, 0, p );
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM