简体   繁体   English

c# 从字节数组中检测 xml 编码?

[英]c# Detect xml encoding from Byte Array?

Well i have a byte array, and i know its a xml serilized object in the byte array is there any way to get the encoding from it?好吧,我有一个字节数组,我知道它是字节数组中的 xml 序列化对象,有什么方法可以从中获取编码吗?

Im not going to deserilize it but im saving it in a xml field on a sql server... so i need to convert it to a string?我不会对其进行反序列化,而是将其保存在 sql 服务器上的 xml 字段中...所以我需要将其转换为字符串?

A solution similar to this question could solve this by using a Stream over the byte array.类似于此问题的解决方案可以通过在字节数组上使用 Stream 来解决此问题 Then you won't have to fiddle at the byte level.那么你就不必在字节级别摆弄了。 Like this:像这样:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

The W3C XML specification has a section on how to determine the encoding of a byte string. W3C XML 规范有一节介绍如何确定字节字符串的编码。

First check for a Unicode Byte Order Mark首先检查 Unicode 字节顺序标记

A BOM is just another character; BOM 只是另一个字符; it's the:它是:

'ZERO WIDTH NO-BREAK SPACE' (U+FEFF) “零宽度无间断空间”(U+FEFF)

For example:例如:

  • NWNBSP < ? NWNBSP < ? x m l ×m  v e r s vè[R小号
  • "\<xml vers"
  • "\\<\?\x\m\l\ \v\e\r\s"
  • U+FEFF U+003C U+003F U+0078 U+006D U+006C U+0020 U+0076 U+0065 U+0072 U+0073 U+FEFF U+003C U+003F U+0078 U+006D U+006C U+0020 U+0076 U+0065 U+0072 U+0073

The character U+FEFF , along with every other character in the file, is encoded using the appropriate encoding scheme:字符U+FEFF以及文件中的所有其他字符都使用适当的编码方案进行编码:

  • 00 00 FE FF : UCS-4, big-endian machine (1234 order) 00 00 FE FF : UCS-4, big-endian machine (1234 order)
  • FF FE 00 00 : UCS-4, little-endian machine (4321 order) FF FE 00 00 : UCS-4, little-endian machine (4321 order)
  • 00 00 FF FE : UCS-4, unusual octet order (2143) 00 00 FF FEUCS-4,异常八位字节顺序 (2143)
  • FE FF 00 00 : UCS-4, unusual octet order (3412) FE FF 00 00UCS-4,异常八位字节顺序 (3412)
  • FE FF ## ## : UTF-16, big-endian FE FF ## ## : UTF-16, big-endian
  • FF FE ## ## : UTF-16, little-endian FF FE ## ##UTF-16,小端
  • EF BB BF : UTF-8 EF BB BF : UTF-8

where ## ## can be anything - except for both being zero其中## ##可以是任何东西 - 除了两者都为零

  • U+FEFF U+003C U+003F U+0078 U+006D U+006C U+0020 U+0076 U+0065 U+0072 U+0073 U+FEFF U+003C U+003F U+0078 U+006D U+006C U+0020 U+0076 U+0065 U+0072 U+0073
  • ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00 ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00
  • ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00 ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00

So first check the inital bytes for any of those signatures.因此,首先检查任何这些签名的初始字节。 If you find one of them, return that code-page identifier如果找到其中之一,则返回该代码页标识符

UInt32 GuessEncoding(byte[] XmlString)
{
   if BytesEqual(XmlString, [00, 00, $fe, $ff]) return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$ff, $fe, 00, 00]) return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $ff, $fe]) throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff, 00, 00]) throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   }
   if BytesEqual(XmlString, [$ff, $fe])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   }
   if BytesEqual(XmlString, [$ef, $bb, $bf])    return 65001; //"utf-8" - Unicode (UTF-8)

Or else look for <?xml或者寻找 <?xml

If the XML document has no Byte Order Mark character, then you move on to looking for the first five characters that every XML document must have:如果 XML 文档没有字节顺序标记字符,则继续查找每个 XML 文档必须具有的前五个字符:

<?xml

It's helpful to know that知道这一点很有帮助

  • < is #x0000003C <是 #x0000003C
  • ? is #x0000003F是#x0000003F

With that we have enough to look at the first four bytes:有了这个,我们有足够的时间来查看前四个字节:

  • 00 00 00 3C : UCS-4, big-endian machine (1234 order) 00 00 00 3C : UCS-4, big-endian machine (1234 order)
  • 3C 00 00 00 : UCS-4, little-endian machine (4321 order) 3C 00 00 00 : UCS-4, little-endian machine (4321 order)
  • 00 00 3C 00 : UCS-4, unusual octet order (2143) 00 00 3C 00UCS-4,不寻常的八位字节顺序(2143)
  • 00 3C 00 00 : UCS-4, unusual octet order (3412) 00 3C 00 00UCS-4,异常八位字节顺序 (3412)
  • 00 3C 00 3F : UTF-16, big-endian 00 3C 00 3FUTF-16,大端
  • 3C 00 3F 00 : UTF-16, little-endian 3C 00 3F 00UTF-16,小端
  • 3C 3F 78 6D : UTF-8 3C 3F 78 6D : UTF-8
  • 4C 6F A7 94 : some flavor of EBCDIC 4C 6F A7 94 : EBCDIC 的一些味道

So we can then add more to our code:所以我们可以在我们的代码中添加更多:

   if BytesEqual(XmlString, [00, 00, 00, $3C])    return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, 00, 00])    return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $3C, 00])    throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, 00])    throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, $3F])   return return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, $3F, 00])   return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   if BytesEqual(XmlString, [$3C, $3F, $78, $6D]) return 65001; //"utf-8" - Unicode (UTF-8)
   if BytesEqual(XmlString, [$4C, $6F, $A7, $94])
   {
      //Some variant of EBCDIC, e.g.:
      //20273   IBM273  IBM EBCDIC Germany
      //20277   IBM277  IBM EBCDIC Denmark-Norway
      //20278   IBM278  IBM EBCDIC Finland-Sweden
      //20280   IBM280  IBM EBCDIC Italy
      //20284   IBM284  IBM EBCDIC Latin America-Spain
      //20285   IBM285  IBM EBCDIC United Kingdom
      //20290   IBM290  IBM EBCDIC Japanese Katakana Extended
      //20297   IBM297  IBM EBCDIC France
      //20420   IBM420  IBM EBCDIC Arabic
      //20423   IBM423  IBM EBCDIC Greek
      //20424   IBM424  IBM EBCDIC Hebrew
      //20833   x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
      //20838   IBM-Thai    IBM EBCDIC Thai
      //20866   koi8-r  Russian (KOI8-R); Cyrillic (KOI8-R)
      //20871   IBM871  IBM EBCDIC Icelandic
      //20880   IBM880  IBM EBCDIC Cyrillic Russian
      //20905   IBM905  IBM EBCDIC Turkish
      //20924   IBM00924    IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
      throw new Exception("We don't support EBCDIC. Sorry");
   }

   //Otherwise assume UTF-8, and fail to decode it anyway
   return 65001; //"utf-8" - Unicode (UTF-8)

   //Any code is in the public domain. No attribution required.
}

You could look at the first 40-ish bytes 1 .您可以查看前 40 个字节1 They should contain the document declaration (assuming it has an document declaration) which should either contain the encoding or you can assume it's UTF-8 or UTF-16, which should should be obvious from how you've understood the <?xml part.它们应该包含文档声明(假设它文档声明),它应该包含编码,或者您可以假设它是 UTF-8 或 UTF-16,从您对<?xml部分的理解来看,这应该是显而易见的。 (Just check for both patterns.) (只需检查两种模式。)

Realistically, do you expect you'll ever get anything other than UTF-8 or UTF-16?实际上,您是否希望得到 UTF-8 或 UTF-16 以外的任何东西? If not, you could check for the patterns you get at the start of both of those and throw an exception if it doesn't follow either pattern.如果没有,您可以检查在这两个模式开始时获得的模式,如果不遵循任一模式,则抛出异常。 Alternatively, if you want to make another attempt, you could always try to decode the document as UTF-8, re-encode it and see if you get the same bytes back.或者,如果您想再次尝试,您总是可以尝试将文档解码为 UTF-8,重新编码并查看是否获得相同的字节。 It's not ideal, but it might just work.这并不理想,但它可能只是工作。

I'm sure there are more rigorous ways of doing this, but they're likely to be finicky :)我确信有更严格的方法可以做到这一点,但它们可能很挑剔:)


1 Quite possibly less than this. 1很可能比这少。 I figure 20 characters should be enough, which is 40 bytes in UTF-16.我认为 20 个字符应该足够了,这是 UTF-16 中的 40 个字节。

The first 2 or 3 bytes may be a Byte Order Mark (BOM) which can tell you whether the stream is UTF-8, Unicode-LittleEndian or Unicode-BigEndian.前 2 或 3 个字节可能是字节顺序标记 (BOM),它可以告诉您流是 UTF-8、Unicode-LittleEndian 还是 Unicode-BigEndian。

UTF-8 BOM is 0xEF 0xBB 0xBF Unicode-Bigendian is 0xFE 0xFF Unicode-LittleEndiaon is 0xFF 0xFE UTF-8 BOM 为 0xEF 0xBB 0xBF Unicode-Bigendian 为 0xFE 0xFF Unicode-LittleEndiaon 为 0xFF 0xFE

If none of these are present then you can use ASCII to test for <?xml (note most modern XML generation sticks to the standard that no white space may preceed the xml declare).如果这些都不存在,那么您可以使用 ASCII 来测试<?xml (请注意,大多数现代 XML 生成都遵循在 xml 声明之前不能有空格的标准)。

ASCII is used up until ?> so you can find the presence of encoding= and find its value. ASCII 一直使用到?>因此您可以找到 encoding= 的存在并找到它的值。 If encoding isn't present or <?xml declare is not present then you can assume UTF-8.如果编码不存在或<?xml声明不存在,那么您可以假设为 UTF-8。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM