更正大型Xml文件中的编码

Question

I'm importing data from XML files containing this type of content: 我正在从包含此类内容的XML文件中导入数据：

<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>

The XML is loaded via: XML通过以下方式加载：

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

When I execute this code with the data contained on top I get an exception about an illegal character. 当我使用包含在顶部的数据执行此代码时，我得到了有关非法字符的异常。 I understand that part just fine. 我知道那部分很好。

I'm not sure which encoding this is or how to solve this problem. 我不确定这是哪种编码或如何解决此问题。 Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly? 有没有一种方法可以更改XmlDocument的编码或其他方法来确保正确解析以上内容？

Update: I do not have any encoding declaration or <?xml in this document. 更新：本文档中没有任何编码声明或<?xml 。

I've seen some links say to add it dynamically? 我看过一些链接说要动态添加吗？ Is this UTF-16 encoding? 这是UTF-16编码吗？

Answer 1

It appears that: 看起来：

The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ ; I don't know what language that is). 这个名字是ÖMÜR HÜNERÖZ （或可能ÔMÜR HÜNERÔZ或ÕMÜR HÜNERÕZ ，我不知道那是什么语言）。
The XML file was encoded using the DOS "OEM" code page, probably 437 or 850. XML文件是使用DOS“ OEM”代码页（可能是437或850）进行编码的。
But it was decoded using windows-1252 (the "ANSI" code page). 但是它是使用Windows-1252（“ ANSI”代码页）进行解码的。

Answer 2

If you look at the file with a hex editor ( HXD or Visual Studio, for instance), what exactly do you see? 如果使用十六进制编辑器（例如HXD或Visual Studio）查看文件，您会看到什么？

Is every character from the string you posted represented by a single byte? 您发布的字符串中的每个字符都用一个字节表示吗？ Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)? 文件是否有字节顺序标记（文件开头有一堆不可打印的字节）？

The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel ( O -M- A -R H- A -NER- O -Z, maybe?), but I haven't figured out yet how they ended up looking like this... ™和š似乎表示在进行编码/转换时出现了一些错误，但让我们看看...我想它们都与元音对应（ O -M- A -R H- A -NER- O- Z，也许？），但我还没有弄清楚他们最终是怎么变成这样的...

Edit : dan04 hit the nail on the head. 编辑： dan04钉在头上。 ™ in cp-1252 has hex value 99, and š is 9a. cp-1252中的 ™十六进制值为99，而š为9a。 In cp-437 and cp-850 , hex 99 represents Ö , and 9a Ü . 在cp-437和cp-850中，十六进制99表示Ö ，9a表示Ü 。

The fix is simple: just specify this encoding when opening your XML file: 修复很简单：打开XML文件时只需指定以下编码：

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}

Answer 3

From here : 从这里：

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

You might want to take a look at this: How to best detect encoding in XML file? 您可能需要看一下：如何最好地检测XML文件中的编码？

For actual reading you can use StreamReader to take care of BOM(Byte order mark): 对于实际阅读，您可以使用StreamReader来处理BOM（字节顺序标记）：

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

Edit: Removed the encoding parameter. 编辑：删除了编码参数。 StreamReader will detect the encoding of a file if the file contains a BOM. 如果文件包含BOM表，则StreamReader将检测文件的编码。 If it does not it will default to UTF8. 如果没有，它将默认为UTF8。

Edit 2 : Detecting Text Encoding for StreamReader 编辑2 ：检测StreamReader的文本编码

Answer 4

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. 显然，您提供了XML文档的一部分，因为它缺少根元素，因此我认为这是您的意图。 Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?> ? 顶部是否有xml处理指令，例如<?xml version="1.0" encoding="UTF-8" ?> ？

更正大型Xml文件中的编码

问题描述

4 个解决方案

解决方案1
3 2010-12-16 01:20:33

解决方案2
2 2010-12-15 20:47:21

解决方案3
1 2010-12-15 19:47:17

解决方案4
0 2010-12-15 19:48:42

更正大型Xml文件中的编码

问题描述

4 个解决方案

解决方案1 3 2010-12-16 01:20:33

解决方案2 2 2010-12-15 20:47:21

解决方案3 1 2010-12-15 19:47:17

解决方案4 0 2010-12-15 19:48:42

解决方案1
3 2010-12-16 01:20:33

解决方案2
2 2010-12-15 20:47:21

解决方案3
1 2010-12-15 19:47:17

解决方案4
0 2010-12-15 19:48:42