简体   繁体   English

如何检索包含印地文文本的字符串中char的unicode十进制表示形式?

[英]How to retrieve the unicode decimal representation of the chars in a string containing hindi text?

I am using visual studio 2010 in c# for converting text into unicodes. 我正在c#中使用Visual Studio 2010将文本转换为unicode。 Like i have a string abc= "मेरा" . 就像我有一个字符串abc =“मेरा”。 there are 4 characters in this string. 该字符串中有4个字符。 i need all the four unicode characters. 我需要所有四个unicode字符。 Please help me. 请帮我。

Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string: 由于.Net char Unicode字符(至少对于BMP代码点而言),因此您可以简单地枚举字符串中的所有字符:

var abc = "मेरा";

foreach (var c in abc)
{
    Console.WriteLine((int)c);
}

resulting in 导致

2350
2375
2352
2366

When you write a code like string abc= "मेरा"; 当您编写类似string abc= "मेरा";的代码时string abc= "मेरा"; , you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. ,您已经将它作为Unicode(特别是UTF-16)使用,因此您无需进行任何转换。 If you want to access the singular characters, you can do that using normal index: eg abc[1] is (DEVANAGARI VOWEL SIGN E). 如果要访问单数字符,则可以使用常规索引进行操作:例如abc[1] (DEVANAGARI VOWEL SIGN E)。

If you want to see the numeric representations of those characters, just cast them to integers. 如果要查看这些字符的数字表示形式,只需将其转换为整数即可。 For example 例如

abc.Select(c => (int)c)

gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString() : 给出数字2350、2375、2352、2366的序列。如果要查看这些数字的十六进制表示,请使用ToString()

abc.Select(c => ((int)c).ToString("x4"))

returns the sequence of strings "092e", "0947", "0930", "093e". 返回字符串“ 092e”,“ 0947”,“ 0930”,“ 093e”的序列。

Note that when I said numeric representations, I actually meant their encoding using UTF-16. 请注意,当我说数字表示形式时,实际上是指使用UTF-16进行编码。 For characters in the Basic Multilingual Plane , this is the same as their Unicode code point. 对于基本多语言平面中的字符,这与它们的Unicode代码点相同。 The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here. 绝大部分使用过的字符都位于BMP中,包括此处介绍的这4种印地语字符。

If you wanted to handle characters in other planes too, you could use code like the following. 如果您也想处理其他平面中的字符,则可以使用以下代码。

byte[] bytes = Encoding.UTF32.GetBytes(abc);

int codePointCount = bytes.Length / 4;

int[] codePoints = new int[codePointCount];

for (int i = 0; i < codePointCount; i++)
    codePoints[i] = BitConverter.ToInt32(bytes, i * 4);

Since UTF-32 encodes all (21-bit) code points directly, this will give you them. 由于UTF-32直接对所有(21位)代码点进行编码,因此可以为您提供这些信息。 (Maybe there is a more straightforward solution, but I haven't found one.) (也许有一个更直接的解决方案,但我还没有找到。)

use 采用

System.Text.Encoding.UTF8.GetBytes(abc)

that will return your unicode values. 这将返回您的unicode值。

If you are trying to convert files from a legacy encoding into Unicode: 如果您尝试将文件从传统编码转换为Unicode:

Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme. 读取文件,提供源文件的正确编码,然后使用所需的Unicode编码方案写入文件。

    using (StreamReader reader = new StreamReader(@"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
    using (StreamWriter writer = new StreamWriter(@"C:\MyConvertedFile.txt", false, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }

If you are looking for a mapping of Devanagari characters to the Unicode code points: 如果要查找梵文字符到Unicode代码点的映射:

You can find the chart at the Unicode Consortium website here . 您可以在图表Unicode协会的网站在这里

Note that Unicode code points are traditionally written in hexidecimal. 请注意,Unicode代码点传统上以十六进制编写。 So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart. 因此,代码点将代替十进制数字2350,而是写为U + 092E,并且在代码表上显示为092E。

If you have the string s = मेरा then you already have the answer. 如果您有字符串s = मेरा那么您已经有了答案。

This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. 该字符串在BMP中包含四个代码点,在UTF-16中由8个字节表示。 You can access them by index with s[i] , with a foreach loop etc. 您可以使用s[i]进行索引,并使用foreach循环等访问它们。

If you want the underlying 8 bytes you can access them as so: 如果需要底层的8个字节,则可以这样访问它们:

string str = @"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM