简体   繁体   English

如何将unicode字符转换为c#中的转义ascii等效字符

[英]How to convert unicode character to its escaped ascii equivalent in c#

i am beginning with a string containing an encoded unicode character " & #xfc; ". 我开始使用包含编码的u​​nicode字符“ ü ”的字符串。 I pass the string to an object that performs some logic and returns another string. 我将字符串传递给执行某些逻辑并返回另一个字符串的对象。 That string is converting the original encoded character to its unicode equivalent " ü ". 该字符串将原始编码字符转换为其unicode等效“ ü ”。

I need to get the original encoded character back but so far am not able. 我需要恢复原始编码字符,但到目前为止还不能。

I have tried using the HttpUtility.HtmlEncode() method but that is returning " & #252; " which is not the same. 我已经尝试使用HttpUtility.HtmlEncode()方法,但是返回“ ü ”这是不一样的。

Can anyone help? 有人可以帮忙吗?

They are pretty much the same, at least for display purposes. 它们几乎相同,至少用于显示目的。 HttpUtility.HtmlEncode is using decimal encoding, which is in the format &#DECIMAL; HttpUtility.HtmlEncode使用十进制编码,格式为&#DECIMAL; while your original version is in hexadecimal encoding, ie in the format &#xHEX; 而您的原始版本采用十六进制编码,即格式为&#xHEX; . Since fc in hex is 252 in decimal, the two are equivalent. 因为十六进制中的fc是十进制的252 ,所以两者是等价的。

If you really need to get the hex-encoded version, then consider parsing out the decimal and converting it to hex before stuffing it back in to the &#xHEX; 如果你真的需要获得十六进制编码版本,那么考虑解析小数并将其转换为十六进制 ,然后再将填充到&#xHEX; format. 格式。 Something like 就像是

string unicode = "ü";
string decimalEncoded = HttpUtility.HtmlEncode(unicode);
int decimal = int.Parse(decimalEncoded.Substring(2, decimalEncoded.Length - 3);
string hexEncoded = string.Format("&#x{0:X};", decimal);

Or you can try this code: 或者您可以尝试以下代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Web;
using System.Configuration;
using System.Globalization;

namespace SimpleCGIEXE
{
    class Program
    {
        static string Uni2Html(string src)
        {
            string temp1 = HttpUtility.UrlEncodeUnicode(src);
            string temp2 = temp1.Replace('+', ' ');
            string res = string.Empty;
            int pos1 = 0, pos2 = 0;
            while (true){
                pos2=temp2.IndexOf("%",pos1);
                if (pos2 < 0) break;
                if (temp2[pos2 + 1] == 'u')
                {
                    res += temp2.Substring(pos1, pos2 - pos1);
                    res += "&#x";
                    res += temp2.Substring(pos2 + 2, 4);
                    res += ";";
                    pos1 = pos2 + 6;
                }
                else
                {
                    res += temp2.Substring(pos1, pos2 - pos1);
                    string stASCII = temp2.Substring(pos2 + 1, 2);
                    byte[] pdASCII = new byte[1];
                    pdASCII[0] = byte.Parse(stASCII, System.Globalization.NumberStyles.AllowHexSpecifier);
                    res += Encoding.ASCII.GetString(pdASCII);
                    pos1 = pos2 + 3;
                }
            }
            res += temp2.Substring(pos1);
            return res;
        }
        static void Main(string[] args)
        {
            Console.WriteLine("Content-type: text/html;charset=utf-8\r\n");
            String st = "Vietnamese string: Thử một xâu unicode @@ # ~ .^ % !";
            Console.WriteLine(Uni2Html(st) + "<br>");
            st = "A chinese string: 我爱你 (I love you)";
            Console.WriteLine(Uni2Html(st) + "<br>");
        }
    }
}

I just had to sort this out yester day. 我不得不把这个问题排除在日期之外。

It's a bit more complicated than just looking at a single character. 这比看一个角色要复杂一点。 You need to roll your own HtmlEncode() method. 您需要滚动自己的HtmlEncode()方法。 Strings in the .Net world are UTF-16 encoded. .Net世界中的字符串是UTF-16编码的。 Unicode codepoints (what an HTML numeric character reference identifies) are a 32-bit unsigned integer value. Unicode代码点(HTML数字字符引用标识的内容)是32位无符号整数值。 This is mostly an issue is you have to deal with characters outside Unicodes "basic multi-lingual plane". 这主要是一个问题,你必须处理Unicodes以外的人物“基本的多语言平面”。

This code should do what you want 这段代码应该做你想要的

using System;
using System.Configuration ;
using System.Globalization ;
using System.Collections.Generic ;
using System.Text;


namespace TestDrive
{
    class Program
    {
        static void Main()
        {
            string src = "foo \uABC123 bar" ;
            string converted = HtmlEncode(src) ;

            return ;
        }

        static string HtmlEncode( string s )
        {
            //
            // In the .Net world, strings are UTF-16 encoded. That means that Unicode codepoints greater than 0x007F
            // are encoded in the string as 2-character digraphs. So to properly turn them into HTML numeric
            // characeter references (decimal or hex), we first need to get the UTF-32 encoding.
            //
            uint[]        utf32Chars = StringToArrayOfUtf32Chars( s ) ;
            StringBuilder sb         = new StringBuilder( 2000 ) ; // set a reasonable initial size for the buffer

            // iterate over the utf-32 encoded characters
            foreach ( uint codePoint in utf32Chars )
            {

                if ( codePoint > 0x0000007F )
                {
                    // if the code point is greater than 0x7F, it gets turned into an HTML numerica character reference
                    sb.AppendFormat( "&#x{0:X};" , codePoint ) ; // hex escape sequence
                  //sb.AppendFormat( "&#{0};"    , codePoint ) ; // decimal escape sequence
                }
                else
                {
                    // if less than or equal to 0x7F, it goes into the string as-is,
                    // except for the 5 SGML/XML/HTML reserved characters. You might
                    // want to also escape all the ASCII control characters (those chars
                    // in the range 0x00 - 0x1F).

                    // convert the unit to an UTF-16 character
                    char ch = Convert.ToChar( codePoint ) ;

                    // do the needful.
                    switch ( ch )
                    {
                    case '"'  : sb.Append( "&quot;"      ) ; break ;
                    case '\'' : sb.Append( "&apos;"      ) ; break ;
                    case '&'  : sb.Append( "&amp;"       ) ; break ;
                    case '<'  : sb.Append( "&lt;"        ) ; break ;
                    case '>'  : sb.Append( "&gt;"        ) ; break ;
                    default   : sb.Append( ch.ToString() ) ; break ;
                    }
                }
            }

            // return the escaped, utf-16 string back to the caller.
            string encoded = sb.ToString() ;
            return encoded ;
        }

        /// <summary>
        /// Convert a UTF-16 encoded .Net string into an array of UTF-32 encoding Unicode chars
        /// </summary>
        /// <param name="s"></param>
        /// <returns></returns>
        private static uint[] StringToArrayOfUtf32Chars( string s )
        {
            Byte[] bytes      = Encoding.UTF32.GetBytes( s ) ;
            uint[] utf32Chars = (uint[]) Array.CreateInstance( typeof(uint) , bytes.Length / sizeof(uint) ) ;

            for ( int i = 0 , j = 0 ; i < bytes.Length ; i += 4 , ++j )
            {
                utf32Chars[ j ] = BitConverter.ToUInt32( bytes , i ) ;
            }

            return utf32Chars ;
        }




    }

}

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM