简体   繁体   English

更正XML编码

[英]Correcting the XML encoding

I have a xml with encoding tag set to 'utf-8'. 我有一个xml,编码标签设置为'utf-8'。 But, it is actually iso-8859-1. 但是,它实际上是iso-8859-1。

Programatically, how do I detect this in perl and python? 以编程方式,我如何在perl和python中检测到这个? and how do I decode with a different coding? 以及如何使用不同的编码进行解码?

In perl, I tried 在perl中,我试过了

$xml = decode('iso-8859-1',$file)

but, this does not work. 但是,这不起作用。

Miscoding is notoriously tricky to detect, as random binary data often represents valid strings in many many encodings. 由于随机二进制数据通常代表许多编码中的有效字符串,因此检测错误是非常棘手的。

In Perl, the easiest thing you could try would be to attempt to decode it as utf-8 and check for failures. 在Perl中,您可以尝试的最简单的方法是尝试将其解码为utf-8并检查故障。 (it only works this way round; a utf-8 encoded western-language document is almost always a valid iso-8859-1 document as well) (它只能以这种方式工作; utf-8编码的西方文档几乎总是一个有效的iso-8859-1文档)

my $xml = eval { decode_utf8( $file, FB_CROAK ) };
if ( $@ ) { is_probably_iso-8859-1_instead }

Now you've detected the problem, you've got to work around it. 现在你已经发现了问题,你必须解决它。 This will most likely depend on the parser library you're using, but some generics ought to apply. 这很可能取决于您正在使用的解析器库,但某些泛型应该适用。

If there's no XML declaration or MIME-type, the Perl native encoding will be used, so the code you copied should do the trick. 如果没有XML声明或MIME类型,将使用Perl本机编码,因此您复制的代码应该可以解决问题。

If there's a mistaken XML declaration, you could either override it using any facility your XML decoding library provides, or just replace it manually before handing it over. 如果存在错误的XML声明,您可以使用XML解码库提供的任何工具覆盖它,或者在交付之前手动替换它。

# assuming it's on line 1:
$contents =~ s/.*/<?xml version="1.0" encoding="ISO-8859-1"?>/;

The general procedure should be the same no matter what language: 无论使用何种语言,一般程序都应该相同:

Open your file, read the raw bytes into a string. 打开文件,将原始字节读入字符串。

Attempt to decode the raw_bytes as UTF-8, with an option that checks for errors or raises an exception if it is not valid UTF-8. 尝试将raw_bytes解码为UTF-8,其中包含检查错误的选项或如果它不是有效的UTF-8则引发异常。

The chance that a file of meaningful Unicode text of reasonable length successfully encoded as ISO-8859-1 will pass this UTF-8 test is very low (unless of course it's ASCII which is a subset of both ISO-8859-1 and UTF-8). 成功编码为ISO-8859-1的合理长度的有意义的Unicode文本文件通过此UTF-8测试的可能性非常低(除非它是ASCII,它是ISO-8859-1和UTF-的子集 - 8)。

If the test fails, strip off the XML declaration if it exists. 如果测试失败,则删除XML声明(如果存在)。 Prepend this: 前置这个:

<?xml version="1.0" encoding="ISO-8859-1"?>

By the way, are you sure you actually have ISO-8859-1 data and not CP1252 data (from a Windows platform)? 那么,您确定您确实拥有ISO-8859-1数据而不是CP1252数据(来自Windows平台)吗?

It goes without saying, of course, that finding and correcting the root cause of a data corruption is always better than trying to detect and repair the corruption after the event. 当然,不用说,找到并纠正数据损坏的根本原因总是比在事件发生后检测和修复损坏更好。

Apart from that, the main point to make is that your file isn't XML so you can't fix it using XML tools. 除此之外,要说明的是,您的文件不是XML,因此您无法使用XML工具进行修复。 You need to attack it at the character or binary level. 您需要在字符或二进制级别攻击它。 As others have said, step 1 is to detect that it's not valid UTF-8; 正如其他人所说,第1步是检测它是无效的UTF-8; step 2 is to strip off the incorrect XML declaration and replace it with a correct one. 第2步是删除不正确的XML声明并将其替换为正确的声明。 Neither of those should be particularly difficult. 这些都不应该特别困难。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM