Java打印unicode故障

Question

我目前正在編寫一個程序來讀取java類文件。 目前，我正在讀取類文件的Constant-Pool（在此處閱讀）並將其打印到控制台。 但是當它被打印出來時，一些unicode似乎以這種方式弄亂我的終端，它看起來像這樣（如果重要的話，我正在閱讀的類文件是從Kotlin編譯的，而終端I我使用的是IntelliJ IDEA終端，雖然在使用常規Ubuntu終端時似乎沒有出現問題。）： 我注意到的是一個奇怪的Unicode序列，我認為它可能是某種逃逸序列。

這是沒有奇怪的unicode序列的整個輸出：

{1=UTF8: (42)'deerangle/decompiler/main/DecompilerMainKt', 2=Class index: 1, 3=UTF8: (16)'java/lang/Object', 4=Class index: 3, 5=UTF8: (4)'main', 6=UTF8: (22)'([Ljava/lang/String;)V', 7=UTF8: (35)'Lorg/jetbrains/annotations/NotNull;', 8=UTF8: (4)'args', 9=String index: 8, 10=UTF8: (30)'kotlin/jvm/internal/Intrinsics', 11=Class index: 10, 12=UTF8: (23)'checkParameterIsNotNull', 13=UTF8: (39)'(Ljava/lang/Object;Ljava/lang/String;)V', 14=Method name index: 12; Type descriptor index: 13, 15=Bootstrap method attribute index: 11; NameType index: 14, 16=UTF8: (12)'java/io/File', 17=Class index: 16, 18=UTF8: (6)'<init>', 19=UTF8: (21)'(Ljava/lang/String;)V', 20=Method name index: 18; Type descriptor index: 19, 21=Bootstrap method attribute index: 17; NameType index: 20, 22=UTF8: (15)'getAbsolutePath', 23=UTF8: (20)'()Ljava/lang/String;', 24=Method name index: 22; Type descriptor index: 23, 25=Bootstrap method attribute index: 17; NameType index: 24, 26=UTF8: (16)'java/lang/System', 27=Class index: 26, 28=UTF8: (3)'out', 29=UTF8: (21)'Ljava/io/PrintStream;', 30=Method name index: 28; Type descriptor index: 29, 31=Bootstrap method attribute index: 27; NameType index: 30, 32=UTF8: (19)'java/io/PrintStream', 33=Class index: 32, 34=UTF8: (5)'print', 35=UTF8: (21)'(Ljava/lang/Object;)V', 36=Method name index: 34; Type descriptor index: 35, 37=Bootstrap method attribute index: 33; NameType index: 36, 38=UTF8: (19)'[Ljava/lang/String;', 39=Class index: 38, 40=UTF8: (17)'Lkotlin/Metadata;', 41=UTF8: (2)'mv', 42=Int: 1, 43=Int: 11, 44=UTF8: (2)'bv', 45=Int: 0, 46=Int: 2, 47=UTF8: (1)'k', 48=UTF8: (2)'d1', 49=UTF8: (58)'WEIRD_UNICODE_SEQUENCE', 50=UTF8: (2)'d2', 51=UTF8: (0)'', 52=UTF8: (10)'Decompiler', 53=UTF8: (17)'DecompilerMain.kt', 54=UTF8: (4)'Code', 55=UTF8: (18)'LocalVariableTable', 56=UTF8: (15)'LineNumberTable', 57=UTF8: (13)'StackMapTable', 58=UTF8: (36)'RuntimeInvisibleParameterAnnotations', 59=UTF8: (10)'SourceFile', 60=UTF8: (20)'SourceDebugExtension', 61=UTF8: (25)'RuntimeVisibleAnnotations'}
AccessFlags: {ACC_PUBLIC, ACC_FINAL, ACC_SUPER}

這是在Sublime Text中打開的Unicode序列：

關於這一切的問題是：為什么這個Unicode打破了IntelliJ IDEA中的控制台，這在Kotlin-Class-Files中是常見的，在打印之前可以做些什么來從String中刪除所有這些“轉義序列”？

Answer 1

出於某種不可思議的原因，當Sun Microsystems設計Java時，他們決定使用非UTF8編碼在常量池中編碼字符串。 它是僅由java編譯器和類加載器使用的自定義編碼。

更糟糕的是，在JVM文檔中他們決定稱之為UTF8。 但它不是 UTF8，他們選擇的名稱會引起很多不必要的混淆。 所以，我在這里推測的是你看到他們稱之為UTF8，所以你把它當作真正的 UTF8來對待，結果就是你收到了垃圾。

您需要在JVM規范中查找CONSTANT_Utf8_info的描述，並編寫一個根據其規范對字符串進行解碼的算法。

為方便起見，這里有一些我編寫的代碼：

public static char[] charsFromBytes( byte[] bytes )
{
    int t = 0;
    int end = bytes.length;
    for( int s = 0;  s < end;  )
    {
        int b1 = bytes[s] & 0xff;
        if( b1 >> 4 >= 0 && b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            s++;
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
            s += 2;
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
            s += 3;
        t++;
    }
    char[] chars = new char[t];
    t = 0;
    for( int s = 0;  s < end;  )
    {
        int b1 = bytes[s++] & 0xff;
        if( b1 >> 4 >= 0 && b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            chars[t++] = (char)b1;
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x1f) << 6) | (b2 & 0x3f));
        }
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            assert s < end : new IncompleteUtf8Exception( s );
            int b3 = bytes[s++] & 0xff;
            assert (b3 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x0f) << 12) | ((b2 & 0x3f) << 6) | (b3 & 0x3f));
        }
        else
            assert false;
    }
    return chars;
}

Answer 2

邁克的答案已經涵蓋了Java類文件並不完全使用UTF8編碼這一事實，但我想我會提供更多有關它的信息。

Java類文件中使用的編碼稱為Modified UTF-8（或MUTF-8）。 它有兩種不同於常規UTF-8的方式：

使用雙字節序列對空字節進行編碼
BMP之外的代碼點用代理對表示，如UTF16中所示。 該對中的每個代碼點依次使用常規UTF8編碼以三個字節編碼。

第一個更改是編碼數據不包含原始空字節，這使得在編寫C代碼時更容易處理。 第二個變化是由於早在90年代，UTF-16風靡一時，並且不清楚UTF-8最終會勝出。 事實上，Java出於類似的原因使用16位字符。 用代理對編碼星體字符使得在16位世界中處理起來更容易。 請注意，大約在同一時間設計的Javascript與UTF-16字符串具有類似的問題。

無論如何，編碼和解碼MUTF-8非常簡單。 這很煩人，因為它不是在任何地方構建的。 解碼時，您以與UTF-8相同的方式進行解碼，您必須更加寬容，除了技術上無效的序列UTF-8（盡管使用相同的編碼），然后替換適用的代理對。 編碼時，您可以執行相反的操作。

請注意，這僅適用於Java字節碼。 Java中的程序員通常不必處理MUTF-8，因為Java在其他地方使用UTF-16和真正的UTF-8混合。

Answer 3

IntelliJ的控制台很可能將字符串的某些字符解釋為控制字符（與Intellij產品中的Colorize控制台輸出相比）。

最有可能的是，它將是ANSI終端仿真，您可以通過執行輕松驗證

System.out.println("Hello "
    + "\33[31mc\33[32mo\33[33ml\33[34mo\33[35mr\33[36me\33[37md"
    + " \33[30mtext");

如果您看到使用不同顏色打印的文本，則它是ANSI終端兼容的解釋。

但是在從未知來源打印字符串時刪除控制字符總是一個好主意。 類文件中的字符串常量不需要具有人類可讀的內容。

一個簡單的方法就是這樣做

System.out.println(string.replaceAll("\\p{IsControl}", "."));

這將在打印前用點替換所有控制字符。

如果你想得到一些關於實際char值的診斷，你可以使用，例如

System.out.println(Pattern.compile("\\p{IsControl}").matcher(string)
    .replaceAll(mr -> String.format("{%02X}", (int)string.charAt(mr.start()))));

這需要Java 9，但當然，也可以為早期的Java版本實現相同的邏輯。 它只需要更冗長的代碼。

Pattern.compile("\\\\p{IsControl}")返回的Pattern實例Pattern.compile("\\\\p{IsControl}")可以存儲和重用。

Java打印unicode故障

問題描述

3 個解決方案

解決方案1
5 2018-11-23 19:45:00

解決方案2
4 2018-11-23 22:02:33

解決方案3
3 已采納 2018-11-26 11:32:38

Java打印unicode故障

問題描述

3 個解決方案

解決方案1 5 2018-11-23 19:45:00

解決方案2 4 2018-11-23 22:02:33

解決方案3 3 已采納 2018-11-26 11:32:38

解決方案1
5 2018-11-23 19:45:00

解決方案2
4 2018-11-23 22:02:33

解決方案3
3 已采納 2018-11-26 11:32:38