简体   繁体   English

在Ruby on Rails上读取文件时,两个相等的XML具有差异

[英]Two equal XML with differences when reading the file on Ruby on Rails

I have to parse XML files coming from two different software. 我必须解析来自两个不同软件的XML文件。 One of the files failed on the parsing process. 其中一个文件在解析过程中失败。 So I started debugging the problem and I come to the point where I copied the "good file" content and paste it to the "bad file". 因此,我开始调试问题,现在我复制了“好文件”内容并将其粘贴到“坏文件”。 But the error persisted! 但是错误仍然存​​在! I also paste the "bad file" content into the good one, and everything worked! 我还将“坏文件”内容粘贴到好文件中,一切正常!

I think that this has to do with some encoding problem. 我认为这与某些编码问题有关。

If an XML file has no encoding declared, is there some metadata that I could be missing? 如果XML文件未声明编码,是否有一些我可能会丢失的元数据?

The output when I read the file on ruby 当我在ruby上读取文件时的输出

File.read(Rails.root.join('bad-file.xml'))

\xFF\xFE<\u0000f\u0000i\u0000l\u0000e\u0000>\u0000\r\u0000<\u0000A\u0000L\u0000L\u0000_\u0000I\u0000N\u0000S\u0000T\u0000A\u0000N\u0000C\u0000E\u0000S\u0000>\u0000\r\u0000\r\u0000<\u0000i\u0000n\u0000s\u0000t\u0000a\u0000n\u0000c\u0000e\u0000>\u0000\r\u0000<\u0000I\u0000D\u0000>\u00009\u00005\u00003\u0000<\u0000/\u0000I\u0000D\u0000>\u0000\r\u0000<\u0000s\u0000t\u0000a\u0000r\u0000t\u0000>\u00005\u00000\u00005\u00009\u0000.\u00002\u00006\u00002\u00002\u00000\u00001....

File.read(Rails.root.join('good-file.xml'))

<file>\r\n<ALL_INSTANCES>\r\n\r\n<instance>\r\n<ID>953</ID>\r\n<start>5059.2622016567</start>\r\n<end>5060.2622016567</end>\r\n<code>timer-1sec</code>\r\n<label>\r\n<group>result</group>\r\n<text>Dabang Eindringen SK</text>\r\n</label>\r\n</instance>\r\n</ALL_INSTANCES>\r\n\r\n<ROWS>\r\n<row>\r\n<code>timer-1sec</code>\r\n<R>0</R>\r\n<G>0</G>\r\n<B>0</B>\r\n</row>\r\n</ROWS>\r\n</file>

Those first 2 bytes \\xFF\\xFE are a unicode byte order mark - they signify that the rest of the data is UTF16, in little endian order 前两个字节\\xFF\\xFE是unicode字节顺序标记-它们表示其余数据为UTF16(以小尾数顺序)

If you do 如果你这样做

File.read(path, mode: 'r:UTF-16LE')

Then the external encoding for the file will be set to that. 然后,文件的外部编码将被设置为该值。 The data is transcoded to the default internal encoding before being returned. 数据在返回之前已转换为默认的内部编码。 You can force that to utf-8 by doing 您可以通过以下方式将其强制为utf-8:

File.read(path, mode: 'r:UTF-16LE:UTF-8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM