简体   繁体   中英

UTF-8, XML, and htmlentities with PHP / Mysql

I have found a lot of varying / inconsistent information across the web on this topic, so I'm hoping someone can help me out with these issues:

I need a function to cleanse a string so that it is safe to insert into a utf-8 mysql db or to write to a utf-8 XML file. Characters that can't be converted to utf-8 should be removed.

For writing to an XML file, I'm also running into the problem of converting html entities into numeric entities. The htmlspecialchars() works almost all the time, but I have read that it is not sufficient for properly cleansing all strings, for example one that contains an invalid html entity.

Thanks for your help, Brian

You didn't say where the strings were coming from, but if you're getting them from an HTML form submission, see this article:

Setting the character encoding in form submit for Internet Explorer

Long and short, you'll need to explicitly tell the browser what charset you want the form submission in. If you specify UTF-8, you should never get invalid UTF-8 from a browser. If you want to protect yourself against ANY type of malicious attack, you'll need to use iconv:

http://www.php.net/iconv

$utf_8_string = iconv($from_charset, $to_charset, $original_string);

If you specify "utf-8" as both $from_charset and $to_charset, iconv() should return an error if $original_string contains invalid UTF-8.

If you're getting your strings from a different source and you know the character encoding, you can still use iconv(). Typical encodings in the US are CP-1252 (Windows) and ISO-8859-1 (everything else.)

Something like this?

function cleanse($in) {
    $bad = Array('”', '“', '’', '‘');
    $good = Array('"', '"', '\'', '\'');
    $out = str_replace($bad, $good, $in);
    return $out;
}

You can convert a string from any encoding to UTF-8 with iconv or mbstring:

// With the //IGNORE flag, this will ignore invalid characters
iconv('input-encoding', 'UTF-8//IGNORE', $the_string);

or

mb_convert_encoding($the_string, 'UTF-8', 'input-encoding');

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM