简体   繁体   English

更正大型Xml文件中的编码

[英]Correcting Encoding in a large Xml File

I'm importing data from XML files containing this type of content: 我正在从包含此类内容的XML文件中导入数据:

<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>

The XML is loaded via: XML通过以下方式加载:

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

When I execute this code with the data contained on top I get an exception about an illegal character. 当我使用包含在顶部的数据执行此代码时,我得到了有关非法字符的异常。 I understand that part just fine. 我知道那部分很好。

I'm not sure which encoding this is or how to solve this problem. 我不确定这是哪种编码或如何解决此问题。 Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly? 有没有一种方法可以更改XmlDocument的编码或其他方法来确保正确解析以上内容?


Update: I do not have any encoding declaration or <?xml in this document. 更新:本文档中没有任何编码声明或<?xml

I've seen some links say to add it dynamically? 我看过一些链接说要动态添加吗? Is this UTF-16 encoding? 这是UTF-16编码吗?

It appears that: 看起来:

  • The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ ; I don't know what language that is). 这个名字是ÖMÜR HÜNERÖZ (或可能ÔMÜR HÜNERÔZÕMÜR HÜNERÕZ ,我不知道那是什么语言)。
  • The XML file was encoded using the DOS "OEM" code page, probably 437 or 850. XML文件是使用DOS“ OEM”代码页(可能是437或850)进行编码的。
  • But it was decoded using windows-1252 (the "ANSI" code page). 但是它是使用Windows-1252(“ ANSI”代码页)进行解码的。

If you look at the file with a hex editor ( HXD or Visual Studio, for instance), what exactly do you see? 如果使用十六进制编辑器(例如HXD或Visual Studio)查看文件,您会看到什么?

Is every character from the string you posted represented by a single byte? 您发布的字符串中的每个字符都用一个字节表示吗? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)? 文件是否有字节顺序标记(文件开头有一堆不可打印的字节)?

The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel ( O -M- A -R H- A -NER- O -Z, maybe?), but I haven't figured out yet how they ended up looking like this... ™和š似乎表示在进行编码/转换时出现了一些错误,但让我们看看...我想它们都与元音对应( O -M- A -R H- A -NER- O- Z,也许?),但我还没有弄清楚他们最终是怎么变成这样的...

Edit : dan04 hit the nail on the head. 编辑dan04钉在头上。 in cp-1252 has hex value 99, and š is 9a. cp-1252中的 十六进制值为99,而š为9a。 In cp-437 and cp-850 , hex 99 represents Ö , and 9a Ü . cp-437cp-850中 ,十六进制99表示Ö ,9a表示Ü

The fix is simple: just specify this encoding when opening your XML file: 修复很简单:打开XML文件时只需指定以下编码:

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}

From here : 这里

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

You might want to take a look at this: How to best detect encoding in XML file? 您可能需要看一下: 如何最好地检测XML文件中的编码?

For actual reading you can use StreamReader to take care of BOM(Byte order mark): 对于实际阅读,您可以使用StreamReader来处理BOM(字节顺序标记):

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

Edit: Removed the encoding parameter. 编辑:删除了编码参数。 StreamReader will detect the encoding of a file if the file contains a BOM. 如果文件包含BOM表,则StreamReader将检测文件的编码。 If it does not it will default to UTF8. 如果没有,它将默认为UTF8。

Edit 2 : Detecting Text Encoding for StreamReader 编辑2检测StreamReader的文本编码

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. 显然,您提供了XML文档的一部分,因为它缺少根元素,因此我认为这是您的意图。 Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?> ? 顶部是否有xml处理指令,例如<?xml version="1.0" encoding="UTF-8" ?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM