[英]How to retrieve the unicode decimal representation of the chars in a string containing hindi text?
I am using visual studio 2010 in c# for converting text into unicodes. 我正在c#中使用Visual Studio 2010将文本转换为unicode。 Like i have a string abc= "मेरा" . 就像我有一个字符串abc =“मेरा”。 there are 4 characters in this string. 该字符串中有4个字符。 i need all the four unicode characters. 我需要所有四个unicode字符。 Please help me. 请帮我。
Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string: 由于.Net char 是 Unicode字符(至少对于BMP代码点而言),因此您可以简单地枚举字符串中的所有字符:
var abc = "मेरा";
foreach (var c in abc)
{
Console.WriteLine((int)c);
}
resulting in 导致
2350
2375
2352
2366
When you write a code like string abc= "मेरा";
当您编写类似string abc= "मेरा";
的代码时string abc= "मेरा";
, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. ,您已经将它作为Unicode(特别是UTF-16)使用,因此您无需进行任何转换。 If you want to access the singular characters, you can do that using normal index: eg abc[1]
is े
(DEVANAGARI VOWEL SIGN E). 如果要访问单数字符,则可以使用常规索引进行操作:例如abc[1]
为े
(DEVANAGARI VOWEL SIGN E)。
If you want to see the numeric representations of those characters, just cast them to integers. 如果要查看这些字符的数字表示形式,只需将其转换为整数即可。 For example 例如
abc.Select(c => (int)c)
gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString()
: 给出数字2350、2375、2352、2366的序列。如果要查看这些数字的十六进制表示,请使用ToString()
:
abc.Select(c => ((int)c).ToString("x4"))
returns the sequence of strings "092e", "0947", "0930", "093e". 返回字符串“ 092e”,“ 0947”,“ 0930”,“ 093e”的序列。
Note that when I said numeric representations, I actually meant their encoding using UTF-16. 请注意,当我说数字表示形式时,实际上是指使用UTF-16进行编码。 For characters in the Basic Multilingual Plane , this is the same as their Unicode code point. 对于基本多语言平面中的字符,这与它们的Unicode代码点相同。 The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here. 绝大部分使用过的字符都位于BMP中,包括此处介绍的这4种印地语字符。
If you wanted to handle characters in other planes too, you could use code like the following. 如果您也想处理其他平面中的字符,则可以使用以下代码。
byte[] bytes = Encoding.UTF32.GetBytes(abc);
int codePointCount = bytes.Length / 4;
int[] codePoints = new int[codePointCount];
for (int i = 0; i < codePointCount; i++)
codePoints[i] = BitConverter.ToInt32(bytes, i * 4);
Since UTF-32 encodes all (21-bit) code points directly, this will give you them. 由于UTF-32直接对所有(21位)代码点进行编码,因此可以为您提供这些信息。 (Maybe there is a more straightforward solution, but I haven't found one.) (也许有一个更直接的解决方案,但我还没有找到。)
use 采用
System.Text.Encoding.UTF8.GetBytes(abc)
that will return your unicode values. 这将返回您的unicode值。
If you are trying to convert files from a legacy encoding into Unicode: 如果您尝试将文件从传统编码转换为Unicode:
Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme. 读取文件,提供源文件的正确编码,然后使用所需的Unicode编码方案写入文件。
using (StreamReader reader = new StreamReader(@"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
using (StreamWriter writer = new StreamWriter(@"C:\MyConvertedFile.txt", false, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
If you are looking for a mapping of Devanagari characters to the Unicode code points: 如果要查找梵文字符到Unicode代码点的映射:
You can find the chart at the Unicode Consortium website here . 您可以在图表Unicode协会的网站在这里 。
Note that Unicode code points are traditionally written in hexidecimal. 请注意,Unicode代码点传统上以十六进制编写。 So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart. 因此,代码点将代替十进制数字2350,而是写为U + 092E,并且在代码表上显示为092E。
If you have the string s = मेरा
then you already have the answer. 如果您有字符串s = मेरा
那么您已经有了答案。
This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. 该字符串在BMP中包含四个代码点,在UTF-16中由8个字节表示。 You can access them by index with s[i]
, with a foreach
loop etc. 您可以使用s[i]
进行索引,并使用foreach
循环等访问它们。
If you want the underlying 8 bytes you can access them as so: 如果需要底层的8个字节,则可以这样访问它们:
string str = @"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.