简体   繁体   English

JavaFX中的Unicode补充平面

[英]Unicode supplementary planes in JavaFX

I'm having problems dealing with Unicode characters from supplementary ("astral") planes in JavaFX. 我在JavaFX中从补充(“星体”)平面处理Unicode字符时遇到问题。 Specifically, I can't paste such characters in a TextInputDialog (I get some weird characters instead, such as ð ), and can't use them in a WebView (they get rendered as ). 具体来说,我不能在TextInputDialog粘贴这些字符(我得到一些奇怪的字符,比如ð ),并且不能在WebView中使用它们(它们被渲染为 )。

The same characters are working perfectly fine if I input them via JOptionPane.showInputDialog and print them to the console. 如果我通过JOptionPane.showInputDialog输入它们并将它们打印到控制台,则相同的字符工作正常。 They even show in a JavaFX Alert , although it appends some junk at the end. 他们甚至在JavaFX Alert显示,尽管它最后添加了一些垃圾。

Is there a way to fix these problems? 有办法解决这些问题吗?

I'm using Oracle JDK version 1.8.0_51 in Linux. 我在Linux中使用Oracle JDK版本1.8.0_51。
Examples of supplementary plane characters: 😀 𐂃 🂡 🙭 𫞂 辅助平面字符的示例:😀𐂃🂡🙭𫞂
If you can't see them, you may need to install additional fonts such as Symbola or Noto. 如果您看不到它们,则可能需要安装其他字体,如Symbola或Noto。

Here's an example program (using a Label rather than a WebView ): 这是一个示例程序(使用Label而不是WebView ):

import javax.swing.JOptionPane;

import javafx.application.Application;
import javafx.scene.Scene;
import javafx.scene.control.Alert;
import javafx.scene.control.Alert.AlertType;
import javafx.scene.control.Label;
import javafx.scene.control.TextInputDialog;
import javafx.scene.layout.StackPane;
import javafx.stage.Stage;

public class UniTest extends Application {
    @Override
    public void start(final Stage stage) throws Exception {
        final String s = new String(new int[]{127137, 178050, 3232, 128512, 241}, 0, 5);
        System.out.println("The string: " + s);
        System.out.println("Characters: " + s.length());
        System.out.println("Code points: " + s.codePoints().count());

        JOptionPane.showMessageDialog(null, s, "JOptionPane", JOptionPane.INFORMATION_MESSAGE);

        final Alert al = new Alert(AlertType.INFORMATION);
        al.setTitle("Alert");
        al.setContentText(s);
        al.showAndWait();

        final TextInputDialog dlg = new TextInputDialog();
        dlg.setTitle("TextInputDialog");
        dlg.setContentText("Try to paste the string in here");
        dlg.showAndWait().ifPresent(x -> System.out.println("Your input: " + x));

        final StackPane root = new StackPane();
        root.getChildren().add(new Label(s));
        stage.setScene(new Scene(root, 400, 300));
        stage.setTitle("Stage");
        stage.show();
    }

    public static void main(final String... args) {
        launch(args);
    }
}

And here are the results I get: 以下是我得到的结果:

截图

Note: not all the characters in the example are from supplementary planes, and one of the characters is only rendered correctly in the console. 注意:并非示例中的所有字符都来自辅助平面,其中一个字符仅在控制台中正确呈现。

TL;DR: Evidently JavaFX is buggy. TL; DR:显然JavaFX是错误的。

Here is the text you are using. 这是您正在使用的文本。

🂡𫞂ಠ😀ñ

Decimal codepoint representation: 十进制码点表示:

127137 178050 3232 128512 241

Hex representation: 十六进制表示

0x1F0A1 0x2B782 0xCA0 0x1F600 0xF1

Display Bug 显示错误

Java uses UTF-16 internally. Java在内部使用UTF-16。 So consider the UTF-16 representation: 所以考虑UTF-16表示:

UTF-16 representation: UTF-16表示:

D83C DCA1 D86D DF82 0CA0 D83D DE00 00F1

We can see that the display is showing the five characters you expect, but then three garbage characters. 我们可以看到显示器显示了您期望的五个字符,但随后显示了三个垃圾字符。

So it is clearly trying to display 8 glyphs, where there are only five. 所以它显然试图显示8个字形,其中只有五个。 This is almost certainly because the display code is counting 8 characters, because three characters are encoded in UTF-16 as surrogate pairs, so take two 16-bit words each. 这几乎可以肯定,因为显示代码计数为8个字符,因为三个字符以UTF-16编码为代理对,因此每个字符取两个16位字。 In other words it is using the wrong value for the length of the string in the presence of surrogate pairs. 换句话说,在代理对存在的情况下,它使用错误的值来表示字符串的长度。

Pasted Text Bug 粘贴文本错误

UTF-8 Representation of test data: UTF-8测试数据的表示:

F0 9F 82 A1 F0 AB 9E 82 E0 B2 A0 F0 9F 98 80 C3 B1

What is seen is 看到的是

00F0 ð LATIN SMALL LETTER ETH 
009F  <control> = APC = APPLICATION PROGRAM COMMAND 
0082  <control> = BPH = BREAK PERMITTED HERE
00A1 ¡ INVERTED EXCLAMATION MARK 
00F0 ð LATIN SMALL LETTER ETH 

(The two control characters can have glyphs in some fonts containing either their abbreviations or hex codes. These are visible in your example.) (这两个控制字符在某些字体中可以包含字形,包含缩写或十六进制代码。这些在您的示例中可见。)

Latin1 hex representation: Latin1十六进制表示:

F0 9F 82 A1 F0

Note that these five bytes are the same as the first five bytes of the UTF-8 representation of the intended text. 请注意,这五个字节与预期文本的UTF-8表示的前五个字节相同。

Conclusion: The pasted data has been pasted as 5 UTF-8 codepoints occupying 17 bytes, but interpreted as 5 Latin1 codepoints occupying 5 bytes. 结论:粘贴的数据被粘贴为占用17个字节的5个UTF-8代码点,但被解释为占用5个字节的5个Latin1代码点。 Again, the wrong property has been used for the length. 同样,错误的属性已被用于长度。

此问题已在Java 10中得到解决。请参阅Java Bug报告

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM