简体   繁体   中英

Correcting the XML encoding

I have a xml with encoding tag set to 'utf-8'. But, it is actually iso-8859-1.

Programatically, how do I detect this in perl and python? and how do I decode with a different coding?

In perl, I tried

$xml = decode('iso-8859-1',$file)

but, this does not work.

Miscoding is notoriously tricky to detect, as random binary data often represents valid strings in many many encodings.

In Perl, the easiest thing you could try would be to attempt to decode it as utf-8 and check for failures. (it only works this way round; a utf-8 encoded western-language document is almost always a valid iso-8859-1 document as well)

my $xml = eval { decode_utf8( $file, FB_CROAK ) };
if ( $@ ) { is_probably_iso-8859-1_instead }

Now you've detected the problem, you've got to work around it. This will most likely depend on the parser library you're using, but some generics ought to apply.

If there's no XML declaration or MIME-type, the Perl native encoding will be used, so the code you copied should do the trick.

If there's a mistaken XML declaration, you could either override it using any facility your XML decoding library provides, or just replace it manually before handing it over.

# assuming it's on line 1:
$contents =~ s/.*/<?xml version="1.0" encoding="ISO-8859-1"?>/;

The general procedure should be the same no matter what language:

Open your file, read the raw bytes into a string.

Attempt to decode the raw_bytes as UTF-8, with an option that checks for errors or raises an exception if it is not valid UTF-8.

The chance that a file of meaningful Unicode text of reasonable length successfully encoded as ISO-8859-1 will pass this UTF-8 test is very low (unless of course it's ASCII which is a subset of both ISO-8859-1 and UTF-8).

If the test fails, strip off the XML declaration if it exists. Prepend this:

<?xml version="1.0" encoding="ISO-8859-1"?>

By the way, are you sure you actually have ISO-8859-1 data and not CP1252 data (from a Windows platform)?

It goes without saying, of course, that finding and correcting the root cause of a data corruption is always better than trying to detect and repair the corruption after the event.

Apart from that, the main point to make is that your file isn't XML so you can't fix it using XML tools. You need to attack it at the character or binary level. As others have said, step 1 is to detect that it's not valid UTF-8; step 2 is to strip off the incorrect XML declaration and replace it with a correct one. Neither of those should be particularly difficult.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM