简体   繁体   English

为什么C#Unicode范围覆盖有限范围(最大0xFFFF)?

[英]Why C# Unicode range cover limited range (up to 0xFFFF)?

I'm getting confused about C# UTF8 encoding... 我对C#UTF8编码感到困惑...

Assuming those "facts" are right: 假设这些“事实”是正确的:

  1. Unicode is the "protocol" which define each character. Unicode是定义每个字符的“协议”。
  2. UTF-8 define the "implementation" - how to store those characters. UTF-8定义“实现”-如何存储那些字符。
  3. Unicode define character range from 0x0000 to 0x10FFFF ( source ) Unicode定义从0x0000到0x10FFFF的字符范围(

According to C# reference , the accepted ranges for each char is 0x0000 to 0xFFFF. 根据C#参考 ,每个字符的可接受范围是0x0000到0xFFFF。 I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol? 我不明白在0xFFFF以上且以Unicode协议定义的其他字符呢?

In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). 与C#相反,当我使用Python编写UTF8文本时-它涵盖了所有预期范围(0x0000至0x10FFFF)。 For example: 例如:

u"\U00010000"  #WORKING!!!

which isn't working for C#. 这不适用于C#。 What's more, when I writing the string u"\\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#! 而且,当我在Python中将字符串u"\\U00010000" (单个字符)写入文本文件,然后从C#中读取它时,这个单字符文档在C#中变成了2个字符!

# Python (write):
import codecs                        
with codes.open("file.txt", "w+", encoding="utf-8") as f:                        
    f.write(text) # len(text) -> 1

// C# (read): 
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2

Why? 为什么? How to fix? 怎么修?

According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. 根据C#参考,每个字符的可接受范围是0x0000到0xFFFF。 I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol? 我不明白在0xFFFF以上且以Unicode协议定义的其他字符呢?

Unfortunately, a C#/.NET char does not represent a Unicode character. 不幸的是,C#/。NET char不代表Unicode字符。

A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. char是16位值,范围为0x0000至0xFFFF,代表一个“ UTF-16代码单元”。 Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there. U + 0000–U + D7FF和U + E000–U + FFFF范围内的字符由相同编号的代码单元表示,因此一切正常。

The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\\U00010000" is C# "\?\?" . 通过将每个字符一起表示为两个UTF-16代码单元,将U + 010000到U + 10FFFF范围内不常用的其他字符压缩到剩余空间0xD800-0xDFFF中,因此等效于Python字符串"\\U00010000"是C# "\?\?"

Why? 为什么?

The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. 之所以如此疯狂,是因为Windows NT系列本身使用UTF-16LE作为本机字符串编码,因此为便于互操作,.NET选择了相同的名称。 WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need. WinNT选择了这种编码(当时被认为是UCS-2,并且没有任何讨厌的替代代码单元对),因为在早期,Unicode最多只能包含U + FFFF的字符,而当时的想法是任何人都需要。

How to fix? 怎么修?

There isn't really a good fix. 确实没有一个好的解决方法。 Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; 不幸的是,一些其他语言无法将其字符串类型基于UTF-16代码单元(Java,JavaScript),它们开始向其字符串中添加方法以对它们进行操作,从而一次计数一个代码点。 but there is no such functionality in .NET at present. 但是.NET目前没有这种功能。

Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. 通常,您实际上实际上并不需要始终使用正确的代码点项目和索引来持续计数/查找/分割/排序/排序字符串。 But when you really really do, in .NET, you're in for a bad time. 但是,当您真正做到这一点时,在.NET中,您将陷入困境。 You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. 您最终不得不通过手动遍历每个char并检查它是否是两个char代理对的一部分,或者将字符串转换为codepoint int数组然后返回来重新实现每个平凡的方法。 This isn't a lot of fun, either way. 无论哪种方式,这都不是很有趣。

A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed. 一个更优雅,更实用的选择是发明一台时光机,因此我们可以将UTF-8设计发送到1988年,以防止UTF-16的存在。

Unicode has so-called planes ( wiki ). Unicode具有所谓的飞机wiki )。

As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane . 如您所见,C#的char类型仅支持第一个平面,即平面0,即基本的多语言平面

I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. 我知道C#使用UTF-16编码,因此我很惊讶地看到它不支持char数据类型中第一个平面以外的代码点。 (haven't run into this issue myself...). (我自己还没有遇到这个问题...)。

This is an artificial restriction in char 's implementation, but one that's understandable. 这是对char的实现的人为限制,但这是可以理解的。 The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). .NET的设计人员可能不想将自己的字符数据类型的抽象与Unicode定义的抽象联系起来,以防标准无法生存(它已经取代了其他标准)。 This is just my guess of course. 当然,这只是我的猜测。 It just "uses" UTF-16 for memory representation. 它只是“使用” UTF-16进行内存表示。

UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here . UTF-16使用技巧将高于0xFFFF的代码点压缩为16位,您可以在此处阅读。 Technically those code points consist of 2 "characters", the so-called surrogate pair . 从技术上讲,这些代码点由2个“字符”组成,即所谓的代理 In that sense it breaks the "one code point = one character" abstraction. 从这种意义上讲,它破坏了“一个代码点=一个字符”的抽象。

You can definitely get around this by working with string and maybe arrays of char . 您绝对可以通过使用string以及char数组来解决此问题。 If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET. 如果您有更具体的问题,则可以在StackOverflow和其他地方找到大量有关使用.NET中所有Unicode代码点的信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM