Manipulating both unicode and ASCII character set in C#

Question

I have this mapping in my C# application

string [,] unicode2Ascii = { { "ஹ", "\\x86" } };

ஹ - is the unicode value for a tamil literal "ஹ". This is the raw hex literal for the unicode value saved by MS Word as a byte sequence. I am trying to map these unicode value "strings" to a hex value under 255 (so as to accommodate non-unicode supported systems).

I trying to use string.replace like this:

S = S.replace(unicode2Ascii[0,0], unicode2Ascii[0,1]);

However the resultant ouput has a ? instead of the actual hex 0x86 stored. Any pointer on how I could set the encoding for the second element of that array to something like windows-1252?

Or is there a better way to do this conversion?

thanks in advance

Answer 1

Not sure if this helps, but the Tamil codepage "57004 - ISCII Tamil" is supported by Windows.

It does not give the same translation for the example character above though. For 'ஹ' it gives 216. Perhaps a different codepage needs to be used?

        string tamilUnicodeString = "ஹ";

        Encoding encoding = Encoding.GetEncoding("x-iscii-ta");

        byte[] codepageBytes = encoding.GetBytes(tamilUnicodeString);

Update

If you wish to take a unicode file as input, transliterate characters to get a single byte representation, the following should do the trick. The resulting array should have your single byte representation if your dictionary encodes each character:

        Dictionary<char, char> lookup = new Dictionary<char, char>
        {
            { 'ஹ', '\x86' },
            { 'இ',  '\x87' },
            //next pair...,
            //etc, etc.
        };

        string input = "ஹஇதில் உள்ள தமிழ் எழுத்துக்கள் சரியாகத் தெரிந்தால்";

        char[] chars = input.ToCharArray();

        for (int i = 0; i < chars.Length; i++)
        {
            char replaceChar;

            if (lookup.TryGetValue(chars[i], out replaceChar))
            {
                chars[i] = replaceChar;
            }
        }

        byte[] output = Encoding.GetEncoding("iso-8859-1").GetBytes(chars);

Answer 2

Strings in .NET are always Unicode internally. However this does not really matter. Strings are a series in characters and .NET strings supports all unicode characters. You should not care how they are presented in memory. You care about encoding only when your strings leave (or enter) .NET (ie when you write (read) them to files, send (receive) them over sockets to other systems, etc.). This is when you use the Encoding class to convert to whatever encoding you desire. Replacing characters or trying any encoding tricks on .NET strings is pointless. Also I recommend this article http://www.joelonsoftware.com/articles/Unicode.html

Manipulating both unicode and ASCII character set in C#

Question

2 answers

solution1
4 ACCPTED 2011-01-05 08:49:13

solution2
3 2011-01-05 08:30:59

Manipulating both unicode and ASCII character set in C#

Question

2 answers

solution1 4 ACCPTED 2011-01-05 08:49:13

solution2 3 2011-01-05 08:30:59

solution1
4 ACCPTED 2011-01-05 08:49:13

solution2
3 2011-01-05 08:30:59