简体   繁体   English

如何在 C# 中获取 unicode 字符的十进制值?

[英]How do i get the decimal value of a unicode character in C#?

How do i get the numeric value of a unicode character in C#?如何在 C# 中获取 unicode 字符的数值?

For example if tamil character ( U+0B85 ) given, output should be 2949 (ie 0x0B85 )例如,如果给出泰米尔语字符 ( U+0B85 ),则输出应为2949 (即0x0B85 )

See also也可以看看

Multi code-point characters多码位字符

Some characters require multiple code points.某些字符需要多个代码点。 In this example, UTF-16, each code unit is still in the Basic Multilingual Plane:在这个例子中,UTF-16,每个代码单元仍然在Basic Multilingual Plane:

  • 在此处输入图片说明 (ie U+0072 U+0327 U+030C ) (即U+0072 U+0327 U+030C
  • 在此处输入图片说明 (ie U+0072 U+0338 U+0327 U+0316 U+0317 U+0300 U+0301 U+0302 U+0308 U+0360 ) (即U+0072 U+0338 U+0327 U+0316 U+0317 U+0300 U+0301 U+0302 U+0308 U+0360

The larger point being that one "character" can require more than 1 UTF-16 code unit, it can require more than 2 UTF-16 code units, it can require more than 3 UTF-16 code units.更大的一点是,一个“字符”可能需要 1 个以上的 UTF-16 代码单元,它可能需要 2 个以上的 UTF-16 代码单元,也可能需要 3 个以上的 UTF-16 代码单元。

The larger point being that one "character" can require dozens of unicode code points.更大的一点是,一个“字符”可能需要数十个 unicode 代码点。 In UTF-16 in C# that means more than 1 char .在 C# 中的 UTF-16 中,这意味着超过 1 个char One character can require 17 char .一个字符可能需要 17 个char

My question was about converting char into a UTF-16 encoding value.我的问题是关于将char转换为 UTF-16 编码值。 Even if an entire string of 17 char only represents one "character", i still want to know how to convert each UTF-16 unit into a numeric value.即使整个 17 个char字符串仅代表一个“字符”,我仍然想知道如何将每个 UTF-16 单元转换为数值。

eg例如

String s = "அ";

int i = Unicode(s[0]);

Where Unicode returns the integer value, as defined by the Unicode standard, for the first character of the input expression.其中Unicode返回输入表达式的第一个字符的整数值,如 Unicode 标准所定义。

It's basically the same as Java.它与Java基本相同。 If you've got it as a char , you can just convert to int implicitly:如果您将其作为char ,则可以隐式转换为int

char c = '\u0b85';

// Implicit conversion: char is basically a 16-bit unsigned integer
int x = c;
Console.WriteLine(x); // Prints 2949

If you've got it as part of a string, just get that single character first:如果您将它作为字符串的一部分,只需先获取该单个字符:

string text = GetText();
int x = text[2]; // Or whatever...

Note that characters not in the basic multilingual plane will be represented as two UTF-16 code units.请注意,不在基本多语言平面中的字符将表示为两个 UTF-16 代码单元。 There is support in .NET for finding the full Unicode code point, but it's not simple .支持.NET中的查找完整的Unicode代码点,但它不是简单的

((int)'அ').ToString()

If you have the character as a char , you can cast that to an int , which will represent the character's numeric value.如果您将字符作为char ,则可以将其转换为int ,这将表示字符的数值。 You can then print that out in any way you like, just like with any other integer.然后你可以用任何你喜欢的方式打印出来,就像任何其他整数一样。

If you wanted hexadecimal output instead, you can use:如果你想要十六进制输出,你可以使用:

((int)'அ').ToString("X4")

X is for hexadecimal, 4 is for zero-padding to four characters. X表示十六进制, 4表示零填充到四个字符。

How do i get the numeric value of a unicode character in C#?如何在 C# 中获取 unicode 字符的数值?

A char is not necessarily the whole Unicode code point . char不一定是整个 Unicode代码点 In UTF-16 encoded languages such as C#, you may actually need 2 char s to represent a single "logical" character.在 UTF-16 编码的语言(例如 C#)中,您实际上可能需要 2 个char来表示单个“逻辑”字符。 And your string lengths migh not be what you expect - the MSDN documnetation for String.Length Property says:并且您的字符串长度可能不是您所期望的 - String.Length 属性MSDN 文档说:

"The Length property returns the number of Char objects in this instance, not the number of Unicode characters." “Length 属性返回此实例中 Char 对象的数量,而不是 Unicode 字符的数量。”

  • So, if your Unicode character is encoded in just one char , it is already numeric (essentially an unsigned 16-bit integer).因此,如果您的 Unicode 字符仅用一个char编码,则它已经是数字(本质上是一个无符号的 16 位整数)。 You may want to cast it to some of the integer types, but this won't change the actual bits that were originally present in the char .您可能希望其转换为某些整数类型,但这不会更改char中最初存在的实际位。
  • If your Unicode character is 2 char s, you'll need to multiply one by 2^16 and add it to the other, resulting in a uint numeric value:如果您的 Unicode 字符是 2 个char ,则需要将一个乘以 2^16 并将其与另一个相加,从而得到一个uint数值:

    char c1 = ...;字符 c1 = ...;
    char c2 = ...;字符 c2 = ...;
    uint c = ((uint)c1 << 16) | uint c = ((uint)c1 << 16) | c2; c2;

How do i get the decimal value of a unicode character in C#?如何在 C# 中获取 unicode 字符的十进制值?

When you say "decimal", this usually means a character string containing only characters that a human being would interpret as decimal digits.当你说“十进制”时,这通常意味着一个字符串,它只包含人类会解释为十进制数字的字符。

  • If you can represent your Unicode character by only one char , you can convert it to decimal string simply by:如果您只能用一个char表示您的 Unicode 字符,您可以简单地将其转换为十进制字符串:

    char c = 'அ'; char c = 'அ';
    string s = ((ushort)c).ToString();字符串 s = ((ushort)c).ToString();

  • If you have 2 chars for your Unicode character, convert them to a uint as described above, then call uint.ToString .如果您的 Unicode 字符有 2 个chars ,请按照上述方法将它们转换为uint ,然后调用uint.ToString

--- EDIT --- - - 编辑 - -

AFAIK diacritical marks are considered separate "characters" (and separate code points) despite being visually rendered together with the "base" character.尽管在视觉上与“基本”字符一起呈现,AFAIK 变音符号仍被视为单独的“字符”(和单独的代码点)。 Each of these code points taken alone is still at most 2 UTF-16 code units.这些代码点中的每一个单独使用仍然最多为 2 个 UTF-16 代码单元。

BTW I think the proper name for what you are talking about is not "character" but "combining character" .顺便说一句,我认为您所谈论的正确名称不是“字符”而是“组合字符” So yes, a single combining character can have more than 1 code point and therefore more than 2 code units.所以是的,单个组合字符可以有 1 个以上的代码点,因此有 2 个以上的代码单元。 If you want a decimal representation of such as combining character, you can probably do it most easily through BigInteger :如果您想要组合字符的十进制表示,您可以通过BigInteger最轻松地完成:

string c = "\x0072\x0338\x0327\x0316\x0317\x0300\x0301\x0302\x0308\x0360";
string s = (new BigInteger(Encoding.Unicode.GetBytes(c))).ToString();

Depending on what order of significance of the code unit "digits" you wish, you may want reverse the c .根据您希望的代码单元“数字”的重要性顺序,您可能需要反转c

This is an example of using Plane 1, the Supplementary Multilingual Plane (SMP):这是使用平面 1,补充多语言平面 (SMP) 的示例:

string single_character = "\U00013000"; //first Egyptian ancient hieroglyph in hex
//it is encoded as 4 bytes (instead of 2)

//get the Unicode index using UTF32 (4 bytes fixed encoding)
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0); //in decimal
char c = 'அ';
short code = (short)c;
ushort code2 = (ushort)c;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM