简体   繁体   中英

PHP: Writing non-english characters to XML - encoding problem

I wrote a small PHP script to edit the site news XML file. I used DOM to manipulate the XML (Loading, writing, editing).

It works fine when writing English characters, but when non-English characters are written, PHP throws an error when trying to load the file.

If I manually type non-English characters into the file - it's loaded perfectly fine, but if PHP writes the non-English characters the encoding goes wrong, although I specified the utf-8 encoding.

Any help is appreciated.

Update: with the helpful answers, it is solved (read below).

Errors:

Warning: DOMDocument::load() [domdocument.load]: Entity 'times' not defined in filepath

Warning: DOMDocument::load() [domdocument.load]: Input is not proper UTF-8, indicate encoding ! Bytes: 0x91 0x26 0x74 0x69 in filepath

Here are the functions responsible for loading and saving the file (self-explanatory):

function get_tags_from_xml(){
// Load news entries from XML file for display
    $errors = Array();

    if(!$xml_file = load_news_file()){
    // Load file
        // String indicates error presence
        $errors = "file not found";
        return $errors;
    }
    $taglist = $xml_file->getElementsByTagName("text");
    return $taglist;
}
function set_news_lang(){
// Sets the news language
    global $news_lang;

    if($_POST["news-lang"]){
        $news_lang = htmlentities($_POST["news-lang"]);
    }
    elseif($_GET["news-lang"]){
        $news_lang = htmlentities($_GET["news-lang"]);
    }
    else{
        $news_lang = "he";
    }
}
function load_news_file(){
// Load XML news file for proccessing, depending on language 
    global $news_lang;

    $doc = new DOMDocument('1.0','utf-8');
    // Create new XML document
    $doc->load("news_{$news_lang}.xml");
    // Load news file by language
    $doc->formatOutput = true;
    // Nicely format the file

    return $doc;
}
function save_news_file($doc){
// Save XML news file, depending on language 
    global $news_lang;

    $doc->saveXML($doc->documentElement);
    $doc->save("news_{$news_lang}.xml");
}

Here is the code for writing to XML (add news):

<?php ob_start()?>
<?php include("include/xml_functions.php")?>
<?php include("../include/functions.php")?>
<?php get_lang();?>
<?php
//TODO: ADD USER AUTHENTICATION!
if(isset($_POST["news"]) && isset($_POST["news-lang"])){
    set_news_lang();

    $news = htmlentities($_POST["news"]);

    $xml_doc = load_news_file();
    $news_list = $xml_doc->getElementsByTagName("text");
    // Get all existing news from file

    $doc_root_element = $xml_doc->getElementsByTagName("news")->item(0);
    // Get the root element of the new XML document
    $new_news_entry = $xml_doc->createElement("text",$news);
    // Create the submited news entry

    $doc_root_element->appendChild($new_news_entry);
    // Append submited news entry
    $xml_doc->appendChild($doc_root_element);

    save_news_file($xml_doc);

    header("Location: /cpanel/index.php?lang={$lang}&news-lang={$news_lang}");
}
else{
    header("Location: /cpanel/index.php?lang={$lang}&news-lang={$news_lang}");
}
?>
<?php ob_end_flush()?>

Update: with the helpful answers you provided, its solved: The value submitted by form is non-English, and it contains some HTML entities, I used htmlentities() on the POST, that made the non-English string unreadable. Replaced htmlentities() with htmlspecialchars() , and it works like magic.

Conclusion: htmlentities() can ruin non-English strings.

Character encoding is always a hassle. Make sure the page containing your form, the xml you load into $dom, and the php file itself are also utf-8 encoded, or translate accordingly. Otherwise all your strings won't be, and handling them as utf-8 won't work.

Try this: echo your original news XML onto an empty page. Then switch page encoding in the browser to see which one displays the characters correctly. Repeat this for $news after retrieving the input from the form. This usually provides a clue on where the encoding goes wrong.

It's hard to diagnose the exact issue without pulling the app apart a bit more, but this is a good clue:

Warning: DOMDocument::load() [domdocument.load]: Entity 'times' not defined in filepath

XML doesn't generally like HTML entities like &times; . The only entities guaranteed to work are &lt; , &gt; , &amp; and &quot; .

Use numeric entities instead. So for ×, use &#xD7; and so on.

Here's a quick and dirty trick you can add after your call to html_entities :

foreach(array('quot'=>34,'amp'=>38,'lt'=>60,'gt'=>62,'OElig'=>338,'oelig'=>339,
'Scaron'=>352,'scaron'=>353,'Yuml'=>376,'circ'=>710,'tilde'=>732,'ensp'=>8194,
'emsp'=>8195,'thinsp'=>8201,'zwnj'=>8204,'zwj'=>8205,'lrm'=>8206,'rlm'=>8207,
'ndash'=>8211,'mdash'=>8212,'lsquo'=>8216,'rsquo'=>8217,'sbquo'=>8218,'ldquo'=>8220,
'rdquo'=>8221,'bdquo'=>8222,'dagger'=>8224,'Dagger'=>8225,'permil'=>8240,'lsaquo'=>8249,
'rsaquo'=>8250,'euro'=>8364,'fnof'=>402,'Alpha'=>913,'Beta'=>914,'Gamma'=>915,
'Delta'=>916,'Epsilon'=>917,'Zeta'=>918,'Eta'=>919,'Theta'=>920,'Iota'=>921,
'Kappa'=>922,'Lambda'=>923,'Mu'=>924,'Nu'=>925,'Xi'=>926,'Omicron'=>927,
'Pi'=>928,'Rho'=>929,'Sigma'=>931,'Tau'=>932,'Upsilon'=>933,'Phi'=>934,
'Chi'=>935,'Psi'=>936,'Omega'=>937,'alpha'=>945,'beta'=>946,'gamma'=>947,
'delta'=>948,'epsilon'=>949,'zeta'=>950,'eta'=>951,'theta'=>952,'iota'=>953,
'kappa'=>954,'lambda'=>955,'mu'=>956,'nu'=>957,'xi'=>958,'omicron'=>959,
'pi'=>960,'rho'=>961,'sigmaf'=>962,'sigma'=>963,'tau'=>964,'upsilon'=>965,
'phi'=>966,'chi'=>967,'psi'=>968,'omega'=>969,'thetasym'=>977,'upsih'=>978,
'piv'=>982,'bull'=>8226,'hellip'=>8230,'prime'=>8242,'Prime'=>8243,'oline'=>8254,
'frasl'=>8260,'weierp'=>8472,'image'=>8465,'real'=>8476,'trade'=>8482,'alefsym'=>8501,
'larr'=>8592,'uarr'=>8593,'rarr'=>8594,'darr'=>8595,'harr'=>8596,'crarr'=>8629,
'lArr'=>8656,'uArr'=>8657,'rArr'=>8658,'dArr'=>8659,'hArr'=>8660,'forall'=>8704,
'part'=>8706,'exist'=>8707,'empty'=>8709,'nabla'=>8711,'isin'=>8712,'notin'=>8713,
'ni'=>8715,'prod'=>8719,'sum'=>8721,'minus'=>8722,'lowast'=>8727,'radic'=>8730,
'prop'=>8733,'infin'=>8734,'ang'=>8736,'and'=>8743,'or'=>8744,'cap'=>8745,
'cup'=>8746,'int'=>8747,'there4'=>8756,'sim'=>8764,'cong'=>8773,'asymp'=>8776,
'ne'=>8800,'equiv'=>8801,'le'=>8804,'ge'=>8805,'sub'=>8834,'sup'=>8835,
'nsub'=>8836,'sube'=>8838,'supe'=>8839,'oplus'=>8853,'otimes'=>8855,'perp'=>8869,
'sdot'=>8901,'lceil'=>8968,'rceil'=>8969,'lfloor'=>8970,'rfloor'=>8971,'lang'=>9001,
'rang'=>9002,'loz'=>9674,'spades'=>9824,'clubs'=>9827,'hearts'=>9829,'diams'=>9830,
'nbsp'=>160,'iexcl'=>161,'cent'=>162,'pound'=>163,'curren'=>164,'yen'=>165,
'brvbar'=>166,'sect'=>167,'uml'=>168,'copy'=>169,'ordf'=>170,'laquo'=>171,
'not'=>172,'shy'=>173,'reg'=>174,'macr'=>175,'deg'=>176,'plusmn'=>177,
'sup2'=>178,'sup3'=>179,'acute'=>180,'micro'=>181,'para'=>182,'middot'=>183,
'cedil'=>184,'sup1'=>185,'ordm'=>186,'raquo'=>187,'frac14'=>188,'frac12'=>189,
'frac34'=>190,'iquest'=>191,'Agrave'=>192,'Aacute'=>193,'Acirc'=>194,'Atilde'=>195,
'Auml'=>196,'Aring'=>197,'AElig'=>198,'Ccedil'=>199,'Egrave'=>200,'Eacute'=>201,
'Ecirc'=>202,'Euml'=>203,'Igrave'=>204,'Iacute'=>205,'Icirc'=>206,'Iuml'=>207,
'ETH'=>208,'Ntilde'=>209,'Ograve'=>210,'Oacute'=>211,'Ocirc'=>212,'Otilde'=>213,
'Ouml'=>214,'times'=>215,'Oslash'=>216,'Ugrave'=>217,'Uacute'=>218,'Ucirc'=>219,
'Uuml'=>220,'Yacute'=>221,'THORN'=>222,'szlig'=>223,'agrave'=>224,'aacute'=>225,
'acirc'=>226,'atilde'=>227,'auml'=>228,'aring'=>229,'aelig'=>230,'ccedil'=>231,
'egrave'=>232,'eacute'=>233,'ecirc'=>234,'euml'=>235,'igrave'=>236,'iacute'=>237,
'icirc'=>238,'iuml'=>239,'eth'=>240,'ntilde'=>241,'ograve'=>242,'oacute'=>243,
'ocirc'=>244,'otilde'=>245,'ouml'=>246,'divide'=>247,'oslash'=>248,'ugrave'=>249,
'uacute'=>250,'ucirc'=>251,'uuml'=>252,'yacute'=>253,'thorn'=>254,'yuml'=>255
) as $alpha=>$num)
$news=str_replace("&$alpha;", "&#$num;", $news);

You can do fancier things with preg_replace and array_map but this is the data you'll need.

Alternatively, if performance is an issue for you, you can do some fancy multi-byte-character detection and bypass the named entity step altogether. There are plenty of examples on the PHP website.

Strictly speaking, if you've marked your XML document as being utf8 encoded, you can leave the entity encoding out completely, and just encode the four main guys:

$table = array('&' => '&amp;', '<' => '&lt;', '>' => '&gt;', '"' => '&quot;');
$news = str_replace(array_keys($table), array_values($table), $_POST["news"]);

n.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM