简体   繁体   English

如何编码和解码破碎的中文/ Unicode字符?

[英]How to encode and decode Broken Chinese/Unicode characters?

I've tried googling around but wasn't able to find what charset that this text below belongs to: 我试过谷歌搜索但无法找到下面这个文本所属的字符集:

具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½® å...·æœ‰éœé>»C”¢C”叶£ç½®ä¹<å½±åƒè¼¸å...¥è£ç½®

But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8"> and keeping that string into an HTML file, I was able to view the Chinese characters properly: 但是将<meta http-equiv="Content-Type" Content="text/html; charset=utf-8">并将该字符串保存为HTML文件,我能够正确地查看中文字符:

具有靜電產生裝置之影像輸入裝置 具有静电产生装置之影像输入装置

So my question is: 所以我的问题是:

  1. What tools can I use to detect the character set of this text? 我可以使用哪些工具来检测此文本的字符集?

  2. And how do I convert/encode/decode them properly in C#? 如何在C#中正确转换/编码/解码它们?

Updates: For completion sake, i've updated this test. 更新:为了完成,我已经更新了这个测试。

   [TestMethod]
    public void TestMethod1()
    {
        string encodedText = "具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
        Encoding utf8 = new UTF8Encoding();
        Encoding window1252 = Encoding.GetEncoding("Windows-1252");

        byte[] postBytes = window1252.GetBytes(encodedText);

        string decodedText = utf8.GetString(postBytes);
        string actualText = "具有靜電產生裝置之影像輸入裝置";
        Assert.AreEqual(actualText, decodedText);
    }
}

Thanks. 谢谢。

What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. 将“坏”字符串保存在带有元标记的文本文件中,声明正确编码时发生的情况是文本编辑器使用Windows-1252编码保存文件,但浏览器正在读取文件并将其解释为UTF -8。 Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8. 由于“坏”字符串使用Windows-1252编码错误地解码UTF-8字节,因此您通过将文件编码为Windows-1252并解码为UTF-8来反转该过程。

Here's an example: 这是一个例子:

using System.Text;
using System.Windows.Forms;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
            Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
            Encoding Utf8 = Encoding.UTF8;
            byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
            string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
            MessageBox.Show(badDecode,"Mis-decoded");  // Shows your garbage string.
            string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
            MessageBox.Show(goodDecode, "Correctly decoded");

            // Recovering from bad decode...
            byte[] originalBytes = Windows1252.GetBytes(badDecode);
            goodDecode = Utf8.GetString(originalBytes);
            MessageBox.Show(goodDecode, "Re-decoded");
        }
    }
}

Even with correct decoding, you'll still need a font that supports the characters being displayed. 即使正确解码,您仍然需要支持显示字符的字体。 If your default font doesn't support Chinese, you still might not see the correct characters. 如果您的默认字体不支持中文,您仍可能看不到正确的字符。

The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. 正确的做法是弄清楚为什么你的字符串首先被解码为Windows-1252。 Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem. 但是,有时,数据库中的数据存储错误,您必须使用这些游戏来解决问题。

string test = "敭畳灴獩楫n"; //incoming data. must be mesutpiskin 

byte[] bytes = Encoding.Unicode.GetBytes(test);

string s = string.Empty;

for (int i = 0; i < bytes.Length; i++)
{
    s += (char)bytes[i];
}

s = s.Trim((char)0);

MessageBox.Show(s);
//s=mesutpiskin 

I'm not really sure what you mean, but I'm guessing you want to convert between a string in a certain encoding in byte array form and a string. 我不太确定你的意思,但我猜你要在字节数组形式的某个编码中的字符串和字符串之间进行转换。 Let's assume the character encoding is called "FooBar": 我们假设字符编码称为“FooBar”:

This is how you encode and decode: 这是你编码和解码的方式:

Encoding myEncoding = Encoding.GetEncoding("FooBar");
string myString = "lala";
byte[] myEncodedBytes = myEncoding.GetBytes(myString);
string myDecodedString = myEncoding.GetString(myEncodedBytes);

You can learn more about the Encoding class over at MSDN . 您可以在MSDN上了解有关Encoding类的更多信息。

Answering your question at the end of your post: 在帖子结尾回答你的问题:

  1. If you want to determine the text encoding on runtime you should look at that: http://code.google.com/p/ude/ 如果要在运行时确定文本编码,您应该查看: http//code.google.com/p/ude/

  2. for converting character sets you can use http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx 对于转换字符集,您可以使用http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx

它是Windows Latin 1.我将中文文本作为UTF-8粘贴到BBEDIT(Mac的文本编辑器)中,并将文件重新打开为Windows Latin 1和bang,出现了确切的变音符号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM