简体   繁体   English

在 C# 中将 ANSI (Windows 1252) 转换为 UTF8

[英]Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time.我之前在 Stack Overflow 上以一种迂回的方式问过这个问题,这次想把它做对。 How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters?如何将 ANSI(代码页 1252)转换为 UTF-8,同时保留特殊字符? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something) (我知道 UTF-8 支持比 ANSI 更大的字符集,但是如果我可以保留 ANSI 支持的所有 UTF-8 字符并将其余字符替换为?或其他东西,那也没关系)

Why I Want To Convert ANSI → UTF-8为什么我要转换 ANSI → UTF-8

I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact.我基本上正在编写一个程序,将 vCard 文件 (VCF) 拆分为单个文件,每个文件包含一个联系人。 I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252).我注意到诺基亚和索尼爱立信手机以 UTF-8(无 BOM)格式保存备份 VCF 文件,但 Android 以 ANSI (1252) 格式保存它。 And God knows in what formats the other phones save them in!天知道其他手机以什么格式保存它们!

So my questions are所以我的问题是

  1. Isn't there an industry standard for vCard files' character encoding? vCard 文件的字符编码没有行业标准吗?
  2. Which is easier for my solving my problem?哪个更容易解决我的问题? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?将 ANSI 转换为 UTF8(和/或相反)或尝试检测输入文件具有哪种编码并通知用户?

tl;dr Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters. tl;dr需要知道如何将字符编码从 (ANSI / UTF8) 转换为 (UTF8 / ANSI),同时保留所有特殊字符。

You shouldn't convert from one encoding to the other.您不应该从一种编码转换为另一种编码。 You have to read each file using the encoding that it was created with, or you will lose information.您必须使用创建文件时使用的编码来读取每个文件,否则您将丢失信息。

Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.使用正确的编码读取文件后,您将获得 Unicode 字符串形式的内容,然后您可以使用您喜欢的任何编码保存它。

If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding.如果需要检测编码,可以将文件作为字节读取,然后查找特定于任一编码的字符代码。 If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.如果文件不包含特殊字符,则任一编码都将起作用,因为这两种编码的字符 32..127 相同。

VCF is encoded in utf-8 as demanded by the spec in chapter 3.4.按照第 3.4 章中的规范要求,VCF 以 utf-8 编码。 You need to take this seriously, the format would be utterly useless if that wasn't cast in stone.你需要认真对待这一点,如果不是一成不变的,这种格式将毫无用处。 If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app.如果您看到某些 Android 应用程序对重音字符进行了重整,请假设这是该应用程序中的错误。 Or more likely, that it got bad info from somewhere else.或者更有可能的是,它从其他地方获得了错误信息。 Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.您尝试更正编码会导致更多问题,因为您的卡版本永远不会与原始卡匹配。

You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[] .您使用 Encoding.GetEncoding(1252).GetString() 从 1252 转换为 utf-8,传入一个byte[] Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse.永远不要试图读取一个字符串,它敲敲罢了到一个字节写入代码[]所以你可以使用的转换方法,只是使编码问题变得更糟。 In other words, you'd need to read the file with FileStream, not StreamReader.换句话说,您需要使用 FileStream 而不是 StreamReader 读取文件。 But again, avoid fixing other people's problems.但同样,避免解决其他人的问题。

This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)这是我在 C# 中使用的(我一直在使用它从 Windows-1252 转换为 UTF8)

    public static String readFileAsUtf8(string fileName)
    {
        Encoding encoding = Encoding.Default;
        String original = String.Empty;

        using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
        {
            original = sr.ReadToEnd();
            encoding = sr.CurrentEncoding;
            sr.Close();
        }

        if (encoding == Encoding.UTF8)
            return original;

        byte[] encBytes = encoding.GetBytes(original);
        byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
        return Encoding.UTF8.GetString(utf8Bytes);
    }

I do it this way:我这样做:

    private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
    {
        string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
        File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
    }

I found this question while working to process a large collection of ancient text files into well formatted PDFs.我在将大量古代文本文件处理成格式良好的 PDF 时发现了这个问题。 None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8.所有文件都没有 BOM,并且最旧的文件包含导致错误解码为 UTF8 的代码页 1252 代码点。 This happens only some of the time, UTF8 works the majority of the time.这只发生在某些时候,UTF8 大部分时间都在工作。 Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.此外,最新的文本数据确实包含 UTF8 代码点,所以它是一个混合包。

So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file?因此,我还设置了“检测输入文件具有哪种编码”并阅读了如何检测文本文件的字符编码? and How to determine the encoding of text?以及如何确定文本的编码? arrived at the conclusion that this would be difficult at best.得出的结论是,这充其量是困难的。

BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:但是,我在评论中找到了每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最小值,阅读它,并找到了这个宝石:

UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. UTF-8 有一个巧妙的副作用,即英语文本在 UTF-8 中看起来与在 ASCII 中完全相同,因此美国人甚至不会注意到任何错误。 Only the rest of the world has to jump through hoops.只有世界其他地方必须跳过箍。 Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold!具体来说,你好,即 U+0048 U+0065 U+006C U+006C U+006F,将存储为 48 65 6C 6C 6F,看哪! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.与存储在 ASCII、ANSI 和地球上的每个 OEM 字符集相同。

The entire article is short and well worth the read.整篇文章很短,值得一读。

So, I solved my problem with the following code.所以,我用下面的代码解决了我的问题。 Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once.由于我的文本数据中只有少量包含困难的字符代码点,因此我不介意异常处理的性能开销,特别是因为这只需要运行一次。 Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.也许有更聪明的方法可以避免try/catch但我没有费心设计一个。

    public static string ReadAllTextFromFile(string file)
    {
        const int WindowsCodepage1252 = 1252;

        string text;

        try
        {
            var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback); 
            text = File.ReadAllText(file, utf8Encoding);
        }
        catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
        {
            var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
            text = File.ReadAllText(file, codepage1252Encoding);
        }

        return text;
    }

It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs.还值得注意的是StreamReader类具有采用特定 Encoding 对象的构造函数,正如我所展示的,您可以调整 EncoderFallback/DecoderFallback 行为以满足您的需要。 So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.因此,如果您需要 StreamReader 或StreamWriter来进行更细粒度的工作,仍然可以使用这种方法。

I use this to convert file encoding to UTF-8我用它来将文件编码转换为 UTF-8

public static void ConvertFileEncoding(String sourcePath, String destPath)
        {
            // If the destination's parent doesn't exist, create it.
            String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
            if (!Directory.Exists(parent))
            {
                Directory.CreateDirectory(parent);
            }

            // Convert the file.
            String tempName = null;
            try
            {
                tempName = Path.GetTempFileName();
                using (StreamReader sr = new StreamReader(sourcePath))
                {
                    using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
                    {
                        int charsRead;
                        char[] buffer = new char[128 * 1024];
                        while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
                        {
                            sw.Write(buffer, 0, charsRead);
                        }
                    }
                }
                File.Delete(destPath);
                File.Move(tempName, destPath);
            }
            finally
            {
                File.Delete(tempName);
            }
        }
  1. Isn't there an industry standard for vCard files' character encoding? vCard 文件的字符编码没有行业标准吗?
  2. Which is easier for my solving my problem?哪个更容易解决我的问题? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?将 ANSI 转换为 UTF8(和/或相反)或尝试检测输入文件具有哪种编码并通知用户?

How I solved this: I have vCard file (*.vcf) - 200 contacts in one file in russian language... I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...我是如何解决这个问题的:我有 vCard 文件 (*.vcf) - 俄语文件中的 200 个联系人......我用 vCardOrganizer 2.1 程序打开它,然后使用 Split 将它划分为 200 ......我看到的是 -接触杂乱的符号,只有我能读懂它的数字:-) ...

Steps: (when you do this steps be patient, sometimes it takes time) Open vCard file (my file size was 3mb) with "notepad" Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save.. I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana@gmail.com步骤:(执行此步骤时请耐心等待,有时需要时间)用“记事本”打开 vCard 文件(我的文件大小为 3mb)然后从菜单 - 文件-另存为..在打开的窗口中选择文件名,不要忘记把 .vcf 和编码 - ANSI 或 UTF-8...最后点击保存...我将 filename.vcf (UTF-8) 转换为 filename.vcf (ANSI) - 没有丢失和完美可读的俄语语言...如果你有任务写:yoshidakatana@gmail.com

Good Luck !!!祝你好运 !!!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM