简体   繁体   中英

Special Character in XML using PHP

I am trying to generate a XML file with some of the values that contains special characters such as μmol/l, x10³ cells/µl and many more. also need functionality to put in superscripts.

I encoded the text μmol/l to something like this using a ordutf8 function from php.net

&#956&#109&#111&#108&#47&#108

function ords_to_unistr($ords, $encoding = 'UTF-8'){
    // Turns an array of ordinal values into a string of unicode characters
    $str = '';
    for($i = 0; $i < sizeof($ords); $i++){
        // Pack this number into a 4-byte string
        // (Or multiple one-byte strings, depending on context.)               
        $v = $ords[$i];
        $str .= pack("N",$v);
    }
    $str = mb_convert_encoding($str,$encoding,"UCS-4BE");
    return($str);           
}

function unistr_to_ords($str, $encoding = 'UTF-8'){       
    // Turns a string of unicode characters into an array of ordinal values,
    // Even if some of those characters are multibyte.
    $str = mb_convert_encoding($str,"UCS-4BE",$encoding);
    $ords = array();

    // Visit each unicode character
    for($i = 0; $i < mb_strlen($str,"UCS-4BE"); $i++){       
        // Now we have 4 bytes. Find their total
        // numeric value.
        $s2 = mb_substr($str,$i,1,"UCS-4BE");                   
        $val = unpack("N",$s2);           
        $ords[] = $val[1];               
    }       
    return($ords);
}

I have sucessfully converted this code back to "richtext" using PHPExcel to generate Excel documents and PDF, but I now need to put it into a XML.

If i use the &# characters as is I get a error message saying

SimpleXMLElement::addChild(): invalid decimal character value

Here are more values I have in the database that needs to be made "XML" friendly

&#120&#49&#48&#60&#115&#117&#112&#62&#54&#60&#47&#115&#117&#112&#62&#32&#99&#101&#108&#108&#115&#47&#181&#108

Converted from x10 3 cells/µl

Here is no need to encode these characters. XML strings can use UTF-8 or another encoding. Depending on the encoding the serializer will encode as necessary.

$foo = new SimpleXmlElement('<?xml version="1.0" encoding="UTF-8"?><foo/>');
$foo->addChild('bar', 'μmol/l, x10³ cells/µl'); 
echo $foo->asXml();

Output (special characters not encoded):

<?xml version="1.0" encoding="UTF-8"?>
<foo><bar>μmol/l, x10³ cells/µl</bar></foo>

To force entities for the special characters, you need to change the encoding:

$foo = new SimpleXmlElement('<?xml version="1.0" encoding="ASCII"?><foo/>');
$foo->addChild('bar', 'μmol/l, x10³ cells/µl');
echo $foo->asXml();

Output (special characters encoded):

<?xml version="1.0" encoding="ASCII"?>
<foo><bar>&#956;mol/l, x10&#179; cells/&#181;l</bar></foo>

I suggest you convert your custom encoding back to UTF-8. That way the XML Api can take care of it. If you like to store string with the custom encoding you need to work around a bug .

A string like &#120&#49&#48&#60&#115&#117 triggers a bug in SimpleXML/DOM. The second argument of SimpleXMLElement::addChild() and DOMDocument::createElement() have a broken escaping. You need to create the content as text node and append it.

Here is a small class that extends SimpleXMLElement and adds a workaround:

class MySimpleXMLElement extends SimpleXMLElement {

  public function addChild($nodeName, $content = NULL) {
    $child = parent::addChild($nodeName);
    if (isset($content)) {
      $node = dom_import_simplexml($child);
      $node->appendChild($node->ownerDocument->createTextNode($content));
    }
    return $child;
  }
}

$foo = new MySimpleXmlElement('<?xml version="1.0" encoding="UTF-8"?><foo/>');
$foo->addChild('bar', '&#120&#49&#48&#60&#115&#117'); 
echo $foo->asXml();

Output:

<?xml version="1.0" encoding="UTF-8"?>
<foo><bar>&amp;#120&amp;#49&amp;#48&amp;#60&amp;#115&amp;#117</bar></foo>

The & from your custom encoding get escaped as the entity &amp; - because it is an special character in XML. The XML parser will decode it.

$xml = <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<foo><bar>&amp;#120&amp;#49&amp;#48&amp;#60&amp;#115&amp;#117</bar></foo>
XML;

$foo = new SimpleXMLElement($xml);
var_dump((string)$foo->bar);

Output:

string(27) "&#120&#49&#48&#60&#115&#117"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM