为什么Java允许在源代码中转义unicode字符？

Question

I recently learned that Unicode is permitted within Java source code not only as Unicode characters (eg. double π = Math.PI; ) but also as escaped sequences (eg. double \π = Math.PI; ). 我最近了解到的是Unicode的是Java源代码内允许不仅为Unicode字符（例如。 double π = Math.PI;而且还为转义序列（例如。 double \π = Math.PI;

The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. 第一个变体对我有意义 - 它允许程序员用他们选择的国际语言命名变量和方法。 However, I don't see any practical application of the second approach. 但是，我没有看到第二种方法的任何实际应用。

Here are a few pieces of code to illustrate usage, tested with Java SE 6 and NetBeans 6.9.1: 以下是一些用于说明用法的代码，使用Java SE 6和NetBeans 6.9.1进行了测试：

This code will print out 3.141592653589793 此代码将打印出3.141592653589793

public static void main(String[] args) {
    double π = Math.PI;
    System.out.println(\u03C0);
}

Explanation: π and \π are the same Unicode character 说明：π和\\ u03C0是相同的Unicode字符

This code will not print out anything 此代码不会打印任何内容

public static void main(String[] args) {
    double π = Math.PI; /\u002A
    System.out.println(π);

    /* a comment */
}

Explanation: The code above actually encodes: 说明：上面的代码实际编码：

public static void main(String[] args) {
    double π = Math.PI; /*
    System.out.println(π);

    /* a comment */
}

Which comments out the print satement. 哪个评论打印出来的声明。

Just from my examples, I notice a number of potential problems with this language feature. 仅从我的示例中，我注意到此语言功能存在许多潜在问题。

First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable. 首先，一个糟糕的程序员可以使用它来秘密注释掉一些代码，或创建多种识别相同变量的方法。 Perhaps there are other horrible things that can be done that I haven't thought of. 也许还有其他可怕的事情可以做，我没有想过。

Second, there seems to be a lack of support among IDEs. 其次，IDE之间似乎缺乏支持。 Neither NetBeans nor Eclipse provided the correct code highlighting for the examples. NetBeans和Eclipse都没有为示例提供正确的代码突出显示。 In fact, NetBeans even marked a syntax error (though compilation was not a problem). 实际上，NetBeans甚至标记了语法错误（尽管编译不是问题）。

Finally, this feature is poorly documented and not commonly accepted. 最后，此功能的记录很少，并且不被普遍接受。 Why would a programmer use something in his code that other programmers will not be able to recognize and understand? 为什么程序员会在代码中使用其他程序员无法识别和理解的东西？ In fact, I couldn't even find something about this on the Hidden Java Features question . 事实上，我甚至在Hidden Java Features问题上找不到这个。

My question is this: 我的问题是：

Why does Java allow escaped Unicode sequences to be used within syntax? 为什么Java允许在语法中使用转义的Unicode序列？ What are some "pros" of this feature that have allowed it to stay a part Java, despite its many "cons"? 尽管有许多“缺点”，但是这个功能有哪些“优点”使它能够成为Java的一部分？

Answer 1

Unicode escape sequences allow you to store and transmit your source code in pure ASCII and still use the entire range of Unicode characters. Unicode转义序列允许您以纯ASCII存储和传输源代码，并仍然使用整个Unicode字符范围。 This has two advantages: 这有两个好处：

No risk of non-ASCII characters getting broken by tools that can't handle them. 没有非ASCII字符的风险被无法处理它们的工具破坏。 This was a real concern back in the early 1990s when Java was designed. 这是在20世纪90年代早期设计Java时的一个真正的问题。 Sending an email containing non-ASCII characters and having it arrive unmangled was the exception rather than the norm. 发送包含非ASCII字符并使其无法到达的电子邮件是例外而不是常态。
No need to tell the compiler and editor/IDE which encoding to use for interpreting the source code. 无需告诉编译器和编辑器/ IDE使用哪种编码来解释源代码。 This is still a very valid concern. 这仍然是一个非常有效的问题。 Of course, a much better solution would have been to have the encoding as metadata in a file header (as in XML), but this hadn't yet emerged as a best practice back then. 当然，更好的解决方案是将编码作为元数据放在文件头中（如XML中），但这还不是当时的最佳实践。

The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. 第一个变体对我有意义 - 它允许程序员用他们选择的国际语言命名变量和方法。 However, I don't see any practical application of the second approach. 但是，我没有看到第二种方法的任何实际应用。

Both will result in exactly the same byte code and have the same power as a language feature. 两者都将产生完全相同的字节代码，并具有与语言功能相同的功能。 The only difference is in the source code. 唯一的区别在于源代码。

First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable. 首先，一个糟糕的程序员可以使用它来秘密注释掉一些代码，或创建多种识别相同变量的方法。

If you're concerned about a programmer deliberately sabotaging your code's readability, this language feature is the least of your problems. 如果您担心程序员故意破坏您的代码的可读性，那么这种语言功能是您遇到的最少问题。

Second, there seems to be a lack of support among IDEs. 其次，IDE之间似乎缺乏支持。

That's hardly the fault of the feature or its designers. 这不是该功能或其设计者的错。 But then, I don't think it was ever intended to be used "manually". 但是，我认为它并不打算“手动”使用。 Ideally, the IDE would have an option to have you enter the characters normally and have them displayed normally, but automatically save them as Unicode escape sequences. 理想情况下，IDE可以选择让您正常输入字符并使它们正常显示，但会自动将它们保存为Unicode转义序列。 There may even already be plugins or configuration options that makes the IDEs behave that way. 甚至可能已经存在使IDE以这种方式运行的插件或配置选项。

But in general, this feature seems to be very rarely used and probably therefore badly supported. 但总的来说，这个功能似乎很少使用，因此可能因此受到严重支持。 But how could the people who designed Java around 1993 have known that? 但是，1993年左右设计Java的人怎么会知道呢？

Answer 2

The nice thing about the \π encoding is that it is much less likely to be munged by a text editor with the wrong encoding settings. 关于\π编码的\π是它不太可能被具有错误编码设置的文本编辑器所控制。 For example a bug in my software was caused by the accidental transformation from UTF-8 é into a MacRoman é by a misconfigured text editor. 例如，在我的软件中的错误是从UTF-8所引起的意外转变é成的MacRoman é被错误配置的文本编辑器。 By specifying the Unicode codepoint, it's completely unambiguous what you mean. 通过指定Unicode代码点，您的意思是完全明确的。

Answer 3

The \\uXXXX syntax allows Unicode characters to be represented unambiguously in a file with an encoding not capable of expressing them directly, or if you want a representation guaranteed to be usable even in the lowest common denominator, namely an 7-bit ASCII encoding. \\ uXXXX语法允许在具有不能直接表达它们的编码的文件中明确地表示Unicode字符，或者如果您希望表示即使在最小公分母（即7位ASCII编码）中也能保证可用。

You could represent all your characters with \\uXXXX, even spaces and letters, but there is rarely a need to. 您可以使用\\ uXXXX表示所有字符，甚至是空格和字母，但很少需要。

Answer 4

First, thank you for the question. 首先，谢谢你的提问。 I think it is very interesting. 我觉得这很有意思。 Second, the reason is that the java source file is a text that can use itself various charsets. 其次，原因是java源文件是一个可以自己使用各种字符集的文本。 For example the default charset in Eclipse is Cp1255. 例如，Eclipse中的默认字符集是Cp1255。 This endoding does not support characters like π. 此编码不支持像π这样的字符。 I think that they thought about programmers that have to work on systems that do not support unicode and wanted to allow these programmers to create unicode enabled software. 我认为他们认为程序员必须在不支持unicode的系统上工作，并希望允许这些程序员创建支持unicode的软件。 This was the reason to support \\u notation. 这是支持\\ u表示法的原因。

为什么Java允许在源代码中转义unicode字符？

问题描述

4 个解决方案

解决方案1
31 已采纳 2010-12-15 09:21:18

解决方案2
8 2010-12-15 08:54:52

解决方案3
3 2010-12-15 09:37:24

解决方案4
2 2010-12-15 08:58:17

为什么Java允许在源代码中转义unicode字符？

问题描述

4 个解决方案

解决方案1 31 已采纳 2010-12-15 09:21:18

解决方案2 8 2010-12-15 08:54:52

解决方案3 3 2010-12-15 09:37:24

解决方案4 2 2010-12-15 08:58:17

解决方案1
31 已采纳 2010-12-15 09:21:18

解决方案2
8 2010-12-15 08:54:52

解决方案3
3 2010-12-15 09:37:24

解决方案4
2 2010-12-15 08:58:17