简体   繁体   中英

How to retrieve the unicode decimal representation of the chars in a string containing hindi text?

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" . there are 4 characters in this string. i need all the four unicode characters. Please help me.

Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:

var abc = "मेरा";

foreach (var c in abc)
{
    Console.WriteLine((int)c);
}

resulting in

2350
2375
2352
2366

When you write a code like string abc= "मेरा"; , you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: eg abc[1] is (DEVANAGARI VOWEL SIGN E).

If you want to see the numeric representations of those characters, just cast them to integers. For example

abc.Select(c => (int)c)

gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString() :

abc.Select(c => ((int)c).ToString("x4"))

returns the sequence of strings "092e", "0947", "0930", "093e".

Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane , this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.

If you wanted to handle characters in other planes too, you could use code like the following.

byte[] bytes = Encoding.UTF32.GetBytes(abc);

int codePointCount = bytes.Length / 4;

int[] codePoints = new int[codePointCount];

for (int i = 0; i < codePointCount; i++)
    codePoints[i] = BitConverter.ToInt32(bytes, i * 4);

Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)

use

System.Text.Encoding.UTF8.GetBytes(abc)

that will return your unicode values.

If you are trying to convert files from a legacy encoding into Unicode:

Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.

    using (StreamReader reader = new StreamReader(@"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
    using (StreamWriter writer = new StreamWriter(@"C:\MyConvertedFile.txt", false, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }

If you are looking for a mapping of Devanagari characters to the Unicode code points:

You can find the chart at the Unicode Consortium website here .

Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.

If you have the string s = मेरा then you already have the answer.

This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i] , with a foreach loop etc.

If you want the underlying 8 bytes you can access them as so:

string str = @"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM