简体   繁体   中英

Handling Encoding Errors While Reading XML With PHP

I'm using XMLReader to parse XML from a 3rd party. The files are supposed to be UTF-8, but I'm getting this error:

parser error : Input is not proper UTF-8, indicate encoding !

Bytes: 0x11 0x72 0x20 0x41 in C:\\file.php on line 166

Looking at the XML file in notepad++ it's clear what's causing this: there is a control character DC1 contained in the problematic line.

The XML file is provided by a 3rd party who I cannot reliably get to fix this/ensure it doesn't happen in the future. Could someone recommend a good way of dealing with this? I'd like to just do away with the control character -- in this particular case just deleting it from the XML file is fine -- but am concerned that always doing this could lead to unforeseen problems down the road. Thanks.

Why can't the 3rd party reliably fix this issue? If they have illegal characters in their XML, I would wager that it's a valid issue.

Having said that, why not just remove the character before you parse it using str_replace ?

You can use str_replace() provided that the string is valid UTF-8. Note that str_replace() will then work with byte offsets, so you are no longer dealing with PHP strings but with byte strings.

And there is the rub: if your 3rd party includes random whitespace and control characters that serve no purpose in XML, you might as well assume they eventually break UTF-8. So you can't use str_replace() with confidence (only in good faith) until you have ascertained that their current dump of the day is not entirely useless.

Maybe you could take a shortcut and stuff it in a libxml DOMDocument object and suppress errors with @, leaving the libxml library to deal with errors. Something like:

$doc = new DOMDocument();
if(@$doc->loadXML($raw_string)) {
  // document is loaded. time to normalize() it.
}
else {
  throw new Exception("This data is junk");
}

Why are you and the third party exchanging data in XML? Presumably both parties expect to get some benefits by using XML rather than some random proprietary format. If you allow them to get away with generating bad XML (I prefer to call it non-XML), then neither party is getting these benefits. It's in their interests to mend their ways. Try to convince them of this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM