[英]Correcting Encoding in a large Xml File
I'm importing data from XML files containing this type of content: 我正在从包含此类内容的XML文件中导入数据:
<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>
The XML is loaded via: XML通过以下方式加载:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
When I execute this code with the data contained on top I get an exception about an illegal character. 当我使用包含在顶部的数据执行此代码时,我得到了有关非法字符的异常。 I understand that part just fine.
我知道那部分很好。
I'm not sure which encoding this is or how to solve this problem. 我不确定这是哪种编码或如何解决此问题。 Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?
有没有一种方法可以更改XmlDocument的编码或其他方法来确保正确解析以上内容?
Update: I do not have any encoding declaration or <?xml
in this document. 更新:本文档中没有任何编码声明或
<?xml
。
I've seen some links say to add it dynamically? 我看过一些链接说要动态添加吗? Is this UTF-16 encoding?
这是UTF-16编码吗?
It appears that: 看起来:
ÖMÜR HÜNERÖZ
(or possibly ÔMÜR HÜNERÔZ
or ÕMÜR HÜNERÕZ
; I don't know what language that is). ÖMÜR HÜNERÖZ
(或可能ÔMÜR HÜNERÔZ
或ÕMÜR HÜNERÕZ
,我不知道那是什么语言)。 If you look at the file with a hex editor ( HXD or Visual Studio, for instance), what exactly do you see? 如果使用十六进制编辑器(例如HXD或Visual Studio)查看文件,您会看到什么?
Is every character from the string you posted represented by a single byte? 您发布的字符串中的每个字符都用一个字节表示吗? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?
文件是否有字节顺序标记(文件开头有一堆不可打印的字节)?
The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel ( O -M- A -R H- A -NER- O -Z, maybe?), but I haven't figured out yet how they ended up looking like this... ™和š似乎表示在进行编码/转换时出现了一些错误,但让我们看看...我想它们都与元音对应( O -M- A -R H- A -NER- O- Z,也许?),但我还没有弄清楚他们最终是怎么变成这样的...
Edit : dan04 hit the nail on the head. 编辑 : dan04钉在头上。
™
in cp-1252 has hex value 99, and š
is 9a. cp-1252中的
™
十六进制值为99,而š
为9a。 In cp-437 and cp-850 , hex 99 represents Ö
, and 9a Ü
. 在cp-437和cp-850中 ,十六进制99表示
Ö
,9a表示Ü
。
The fix is simple: just specify this encoding when opening your XML file: 修复很简单:打开XML文件时只需指定以下编码:
XmlDocument doc = new XmlDocument();
using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
doc.Load(reader);
}
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
You might want to take a look at this: How to best detect encoding in XML file? 您可能需要看一下: 如何最好地检测XML文件中的编码?
For actual reading you can use StreamReader to take care of BOM(Byte order mark): 对于实际阅读,您可以使用StreamReader来处理BOM(字节顺序标记):
string xml;
using (var reader = new StreamReader("FilePath", true))
{ // ↑
xml= reader.ReadToEnd(); // detectEncodingFromByteOrderMarks
}
Edit: Removed the encoding parameter. 编辑:删除了编码参数。 StreamReader will detect the encoding of a file if the file contains a BOM.
如果文件包含BOM表,则StreamReader将检测文件的编码。 If it does not it will default to UTF8.
如果没有,它将默认为UTF8。
Edit 2 : Detecting Text Encoding for StreamReader 编辑2 : 检测StreamReader的文本编码
Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. 显然,您提供了XML文档的一部分,因为它缺少根元素,因此我认为这是您的意图。 Is there an xml processing instruction at the top like
<?xml version="1.0" encoding="UTF-8" ?>
? 顶部是否有xml处理指令,例如
<?xml version="1.0" encoding="UTF-8" ?>
?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.